Sorting a massive file
I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query:
bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt
The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command:
LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l
This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).
With thanks to http://unix.stackexchange.com/questions/120096/how-to-sort-big-files