Derek Swingley

Re-visiting splitting a large .csv

Derek Swingley — Fri, 15 May 2020 22:29:35 GMT

Follow-up on a years-old post titled Options for splitting a large .csv into smaller files: awk, technically gawk, is the best way to do this.

I wanted tinker with a 4.5 GB .csv file. It has data from 2012 - 2020 so I wanted to split it by year. There isn't a year column but each row has a date column in the format mm/dd/yyyy. It turns out gawk does make this pretty easy, even if the syntax to do it is funky:

gawk -F ',' 'NR==1{ h=$0 }NR>1{ print (!a[substr($2,7,4)]++? h ORS $0 : $0) > substr($2,7,4)".csv" }' big.csv

This mostly came from a Unix & Linux StackExchange answer and the description there does a good job of explaining what's going on. In addition, my small changes:

specify the delimiter with -F (gawk defaults to whitespace)
use substr() to pull out the year from the second column ($2)
add column headers when starting a new file

Losing column headers was why I'd avoided awk in the past. By keeping them in a variable, h, and using that variable to write headers when encountering a new year, !a[substr($2,7,4)], this does exactly what you want when splitting one .csv file into many.

It's simple to get gawk from homebrew, so the bar for using this tool is pretty low. This command gets through that 4.5 GB .csv file in a couple minutes.

Put statements in positive form

Derek Swingley — Sun, 10 May 2020 19:53:29 GMT

How often have you seen code where the absence of something is negated? Sounds confusing, because it is.

Here's an example, assuming a config object with some properties, the value for disableSomething is a boolean:
!config.disableSomething

That result of that line will be true if disableSomething is false. Used as a condition, we would be doing something if this config property is not disabled. Wouldn't it be better to say enable? It seems like a minor point, but this type of double negative increases complexity and likelihood of a misunderstanding when a positive equivalent will do. A positive form is simpler to process (for a programmer, not necessarily for a computer): use enableSomething instead of disableSomething.

Disable/enable are a common pair, but nearly as common, at least when dong frontend work, is show/hide. Much as I prefer to say something is enabled, I much prefer to say show instead of hide.

A section from Strunk & White's The Elements of Style is titled "Put Statements in a Positive Form". That perfectly summarizes the point I'm making. The Elements of Style applies to programming too!

Much like writing in general, the rule of thumb should be to opt for the positive form rather than the negative. Saying something is "enabled" is positive, it conveys the existence of something. Saying something is "disabled" is negative. We are talking about the absence of something. Please avoid the negative form and the double negatives.

Being right all the time

Derek Swingley — Thu, 07 Nov 2019 22:41:00 GMT

A week or so ago I watched Rhea Butcher's talk from this year's XOXO. The whole thing is worth your time but the piece that prompted me to post here starts just past 30 minutes.

That is a perspective to aspire to. Something along these lines, by which I mean being humble and approaching differing views with openness, has been rattling around in my head for a while now. I've had "in an argument, being right and being wrong feels exactly the same" in a note on my phone for a couple years. This is a better articulation of that.

How many sold in the last how many hours?

Derek Swingley — Fri, 01 Nov 2019 00:03:15 GMT

Before today, I'd never heard of pawramp.com. I ended up there because I was considering buying one of their products as gift. When I ended up on a product page, I couldn't help but notice the blinking 🔥and the impressive stat that "65 sold in last 11 hours".

Hmmm. Why would they say last 11 hours. It was roughly 11am, I think, when I visited this site so maybe it's some kind of count since midnight? That's a stretch but plausible. Or this is one of those BS counters that's a pretty well worn dark pattern (booking.com is the main offender I'm aware of) to nudge people to buy something. I hit refresh a couple of times and got completely different numbers each time. Try it yourself.

So these are 💩numbers, made up to try to conjure some kind of peer pressure effect and convince people that yep, they need a ramp for paws.

Looking closer, the number sold in the last whatever hours do have their own placeholders in the page:

I started looking at scripts loaded via dev tools. pawramp.js seemed like an obvious candidate so I took a look at that.

The first function in that file is flashSoldBar, and, it turns out, that's what's populating the sold count and number of hours. Here's the function:

function flashSoldBar(prefix) {  
  //if (!nathan_settings.flash_sold) return;
  var minQty = nathan_settings.flash_sold_min;
  var maxQty = nathan_settings.flash_sold_max;
  var minTime = nathan_settings.flash_min_time;
  var maxTime = nathan_settings.flash_max_time;
  minQty = Math.ceil(minQty);
  maxQty = Math.floor(maxQty);
  minTime = Math.ceil(minTime);
  maxTime = Math.floor(maxTime);

  var qty = Math.floor(Math.random() * (maxQty - minQty + 1)) + minQty;
  qty = parseInt(qty);
  if (qty <= minQty) {
    qty = minQty;
  }
  if (qty > maxQty) {
    qty = maxQty;
  }
  jQuery(".nt_flash_total_day" + prefix).html(qty);

  var time = Math.floor(Math.random() * (maxTime - minTime + 1)) + minTime;
  time = parseInt(time);
  if (time <= minTime) {
    time = minTime;
  }
  if (time > maxTime) {
    time = maxTime;
  }
  jQuery(".nt_flash_in_hour" + prefix).html(time);
}

and the settings object:

{
  "flash_sold": true,
  "flash_sold_min": 13,
  "flash_sold_max": 81,
  "flash_min_time": 6,
  "flash_max_time": 24
}

From that we can see that the page is going to say anywhere from 13 - 81 items were sold in the last 6 - 24 hours. The fun thing is that the function is defined globally so we can call it from the dev tools console. If we call it from a timer, it'll repeatedly update the counts, take a look:

I like supporting small companies like this but I wish they'd quit with this bullshit. I was pretty sure I was going to buy their product without knowing how many people also supposedly bought it in the last 15 hours. How many of those 160 reviews are fake too?

What's that? Oh, yes, I did buy a pawramp.

Get total vertex count for a shapefile

Derek Swingley — Tue, 22 Oct 2019 21:16:39 GMT

I needed a quick way to get the total vertex count in a shapefile. The ogrinfo command includes vertex count in its output. This command:

$ ogrinfo -dialect SQLite -sql "SELECT sum(ST_NPoints(geometry)) AS vertex_count FROM OK_second_congressional_district" OK_second_congressional_district.shp

yields:

INFO: Open of `OK_second_congressional_district.shp'  
      using driver `ESRI Shapefile' successful.

Layer name: SELECT  
Geometry: None  
Feature Count: 1  
Layer SRS WKT:  
(unknown)
vertex_count: Integer (0.0)  
OGRFeature(SELECT):0  
  vertex_count (Integer) = 203

But that is a lot to type (and remember).

Since the only thing that changes between running commands is the name of a shapefile, it's easy to call this from a shell script:

#!/usr/bin/env bash
printf "\nLayer: $1\n"  
ogrinfo -dialect SQLite -sql "SELECT sum(ST_NPoints(geometry)) AS vertex_count FROM $1" $1.shp | grep "vertex_count (Integer) ="  
printf "\n"

Using $1 says take the second argument (in our case, the name of a layer / shapefile) and use that in the ogrinfo command.

To use this script, save it as file as somewhere (I called it vc), make it executable, copy it to a directory in $PATH, and now it's much easier to get that count I wanted:

$ vc OK_second_congressional_district

Layer: OK_second_congressional_district  
  vertex_count (Integer) = 203

Using the Census API to get county FIPS codes

Derek Swingley — Mon, 14 Oct 2019 04:16:56 GMT

I came across a .csv file recently that had some data I wanted to see on a map but the .csv only had county names (formatted as 'some county (state)'). To get it on a map, I'd need to get FIPS codes for each county alongside this data.

It'd been a while since I'd done something like this. I grabbed some shapefiles from the Census but got about half-way through before I decided I wanted to see if there was a Census API I could hit for FIPS codes. Luckily, it wasn't too hard to find what I needed. Here's the URL to get FIPS codes for every county in the US:

https://api.census.gov/data/2010/dec/sf1?get=NAME&for=county:*

Note that there are no state names in that response, but there are FIPS codes for states. There's probably a way to get state name in the response, but I found it easier to go get state names and FIPS codes for each state then combine that with the counties info. Here's the URL to get info about all US states:

https://api.census.gov/data/2010/dec/sf1?get=NAME&for=state:*

With both of those responses saved to disk via a script I built a county (state) to FIPS object to get a FIPS code next to the data in question so I could put it on a map.

Once I had a lookup, I wrote a node script to transform my .csv to a new .csv that includes FIPS code for each row. I mapped that file over on Datawrapper.

Data visualization 30ish years ago

Derek Swingley — Thu, 06 Sep 2018 17:03:14 GMT

First-hand account and enjoyable read about how data visualization was done a few decades ago: Data visualisation, from 1987 to today by Graham Douglas.

I knew I'd enjoy this when I read:

Back in the 1980s we weren’t called data visualisers.

Better JavaScript string sorting with collators

Derek Swingley — Tue, 08 May 2018 15:30:37 GMT

Pop quiz: how do you sort an array of strings so that any numbers (stored as strings), are sorted in the correct order? That is, if you have:

const vals = ['1743', '596', '94', '337']

After sorting, the expected result is:

['94', '337', '596', '1743']

JavaScript's default array sorting method doesn't do what you need or expect. The reason is easy to find, and is the third sentence on the MDN page for Array.prototype.sort:

The default sort order is according to string Unicode code points.

For years and years, JS devs have been writing their own compare functions. It's just one of those things that you learn to do even though it feels like something that should be baked into the language.

I recently had to deal with sorting some arrays similar to the above example. In my internet travels, I came across Intl.Collator via StackOverflow.

Initially, I was skeptical and confused as to why something related to i18n would have this capability, but it works!

It's a little more code than calling .sort(), but not much:

const collator = new Intl.Collator(undefined, {  
  numeric: true,
  sensitivity: 'base'
})
vals.sort(collator.compare)

Now vals is in the expected order. No need to mess with various natural sort algorithm implementations, collator is in the language and well supported.

Linear versus logarithmic scales

Derek Swingley — Thu, 22 Mar 2018 15:40:10 GMT

Stephen Few's recent post titled Logarithmic Confusion contains a succinct description of linear vs. logarithmic scales:

units along a logarithmic scale increase by rate (e.g., ten times the previous value for a log base 10 scale or two times the previous value for a log base 2 scale), not by amount

For whatever reason, that description resonates with me. I like it better than what's on wikipedia.

The entire post is a worth reading. The gist is someone with a keen eye (Few) suspected a particular chart fit a pleasing narrative a little too well.

No more twitter

Derek Swingley — Wed, 07 Mar 2018 18:30:34 GMT

Back when twitter rolled out 280 characters to everyone, I realized I'd had enough. It wasn't just the doubling of the character limit, that was more of a coincidence. That change was an inflection point for me. It was also right around when the outrage du jour was that some nazi got verified. I don't recall the specifics and they don't matter. I told myself that it was time to stop directly going to Twitter. This was back in early November of last year.

It's been a few months, and I still end up on twitter once in a while (primarily from links I click on belong.io) but I don't use any twitter clients and don't go to twitter.com on my own. I don't miss it and my gut tells me I'm better off without it. Without the distraction and time-suck, I have more time decompress, more time for thoughts to develop as opposed to constantly reaching for and craving more more more. I feel more put together. My current view on twitter, after distancing myself from it, is that if consumed regularly and at length, it is too much to handle in a healthy way. At least with the timeline I'd built, so that's why I quit.

In the months since abstaining from twitter, not a week goes by that I don't see someone else pontificating on the perils of social sites. The latest to get my attention is a piece in The New York Times. The salient lines:

If something really big happens, you will find out.

and:

I began to see it wasn’t newspapers that were so great, but social media that was so bad.

I couldn't agree more with those statements. So add me to the pile of people telling you to get off social media: stop using big social media sites, you'll be better for it in the long run.

Figuring out why running Ghostery on FiveThirtyEight locks up a browser

Derek Swingley — Fri, 26 Jan 2018 08:18:46 GMT

I follow what's published on FiveThirtyEight via their RSS feed. On average, I probably click through to one article a week. A week or two ago, I clicked through to a FiveThirtyEight story and the browser tab where it was loading became unresponsive. My reaction was "somehow they managed to get around Ghostery and do some advertising/tracking shenanigans to freeze my browser, oh well", and I closed the tab. The tab hung around for a minute but then went away. I figured it was an isolated incident.

Fast-forward to this morning. I saw a story I was interested in reading in my feed reader so I clicked the link. New tab opens, and is immediately unusable. I can't scroll, I can't open dev tools, I can't close the tab. This time I was a bit more motivated to figure out what was going on.

Once I got the rogue tab closed, I whitelisted FiveThirtyEight. Then the page loaded and operated as expected. But by this point, I couldn't care less about the actual content—I wanted to know why the page froze when running an ad blocker.

If you've run an ad blocker for any amount of time, you have seen weird, half-rendered pages while cruising around the web. Occasionally sites will only render their content after some combination of advertising and tracking scripts have loaded. Others will show you a pop-up if they detect a blocker and ask to be whitelisted. Sometimes I whitelist, sometimes I give up and forget about whatever previously grabbed my attention.

This FiveThirtyEight experience was different. Missing content and broken pages, I understand that. A locked up browser tab from a reputable site I frequent is something different. I really wanted to figure out what was happening.

Ghostery was clearly the culprit, but why? I've been using it for a couple years now, and can't recall having a problem like this in the past.

I started re-blocking each tracker identified by Ghostery. I'd block one, reload. If the page worked, I went to the next one. None of the trackers from the big, recognizable companies (Google, Facebook, Adobe) were causing the problem. I could block those and the page still worked.

I narrowed down the problem to the advertising tracker called "NetRatings SiteCensus". I'd never heard of it, but apparently it's something from Nielsen (of TV ratings fame, I think).

Visiting the page in question with dev tools gave me a little more info:

Over 2300 requests, running one request per millisecond based on the timestamp in the URL query string. That's enough to freeze any browser tab. The full URL to the script in question is: https://cdn-gl.imrworldwide.com/novms/js/2/ggcmb510.js?_=timestamp. That script wasn't necessarily problematic, but rather all the requests to get it.

Looking at the "Initiator" for all these requests, I headed over to https://secure.espn.com/combiner/c?js=jquery-1.10.2.js,plugins/jquery.pubsub.r5.js,analytics/visitorAPI_156.js,analytics/sOmni_161.js&ver=20180126. Digging through there, I found the place where some javascript was making a request to get the js file from imrworldwide.com:

Now I was getting somewhere: espn.track.initNielsen was trying to load the script in question, then executing a function to call espn.track.nielsen. Who was calling initNielsen? Well, nielsen, of course. The infinite loop was taking shape.

What about the next level up the stack? Who starts this nielsen → initNielsen → nielsen madness?

After diligently dissecting 8k lines of code... just kidding, I ⌘ + f-ed around until I figured out that espn.track.nielsen is first called by another function called init defined inside of another function called espn.track.trackPage. That one, trackPage, is called by another function called loadOmniture that's defined in a