Re-visiting splitting a large .csv

Follow-up on a years-old post titled Options for splitting a large .csv into smaller files: awk, technically gawk, is the best way to do this.

I wanted tinker with a 4.5 GB .csv file. It has data from 2012 - 2020 so I wanted to split it by year. There isn't a year column but each row has a date column in the format mm/dd/yyyy. It turns out gawk does make this pretty easy, even if the syntax to do it is funky:

gawk -F ',' 'NR==1{ h=$0 }NR>1{ print (!a[substr($2,7,4)]++? h ORS $0 : $0) > substr($2,7,4)".csv" }' big.csv  

This mostly came from a Unix & Linux StackExchange answer and the description there does a good job of explaining what's going on. In addition, my small changes:

  • specify the delimiter with -F (gawk defaults to whitespace)
  • use substr() to pull out the year from the second column ($2)
  • add column headers when starting a new file

Losing column headers was why I'd avoided awk in the past. By keeping them in a variable, h, and using that variable to write headers when encountering a new year, !a[substr($2,7,4)], this does exactly what you want when splitting one .csv file into many.

It's simple to get gawk from homebrew, so the bar for using this tool is pretty low. This command gets through that 4.5 GB .csv file in a couple minutes.