Options for splitting a large .csv into smaller files
I was recently messing around with a .csv file that weighed in at ~38 MB and had ~750K lines. It was data for the 50 states so I figured maybe it'd be a good idea to split it by states to make it easier to manage and manipulate.
The first option that came to mind was csvkit's csvgrep.
Here's how I used it:
csvgrep -c State -m CA data.csv > data_california.csv
csvgrep -c State -m CT data.csv > data_connecticut.csv
That works. But it was taking a little time and it seemed wasteful to read the whole file each time. I figured there'd be a unix-y way to do this faster.
After much googling, stack overflowing and experimenting, I ended up using awk
to read and split the file. Like anything shell-scripting-ish, it looks like nonsense:
awk -F ',' '{if (NR!=1) {print >> (substr($2,2,2)".csv"); close(substr($2,2,2)".csv")}}' data.csv
But it's pretty slick that it works. It starts by saying columns are delimited with a comma. Then it says that as long as it's not on record one (the first line with the field names), print the whole line to a file that corresponds to the value in the second column ($2
). The close(...)
part is there to avoid an error about "too many open files". The whole thing takes less than two minutes and would probably be faster if it wasn't constantly closing and opening files. The downside of this approach (besides readability of the command) is that the resulting files do not have field names in the first line. For what I was doing, that wasn't a big deal but could be annoying in other scenarios.
At this point I was satisfied. But I couldn't help but wonder: is there a node module that will do this? And will it be faster? The answers are yes and yes. That module does this job in about 30 seconds, it preserves the line with column names and the syntax to do it is easier to understand:
csv-split -i ../data.csv -b State -o out