Command-line Arguments
If you have not already, I encourage you to copy the "new_py.py" script into your $PATH
and then execute it with the -a
argument to start a new script with argparse
:
$ ./new_py.py -a test
Done, see new script "test.py."
If you check out the new script, it has a get_args
function that will show you how to create named arguments for strings, integers, booleans, and positional arguments:
$ cat -n test.py
1 #!/usr/bin/env python3
2
3 import argparse
4 import os
5 import sys
6
7 # --------------------------------------------------
8 def get_args():
9 parser = argparse.ArgumentParser(description='Argparse Python script')
10 parser.add_argument('positional', metavar='str', help='A positional argument')
11 parser.add_argument('-a', '--arg', help='A named string argument',
12 metavar='str', type=str, default='')
13 parser.add_argument('-i', '--int', help='A named integer argument',
14 metavar='int', type=int, default=0)
15 parser.add_argument('-f', '--flag', help='A boolean flag',
16 action='store_true')
17 return parser.parse_args()
18
19 # --------------------------------------------------
20 def main():
21 args = get_args()
22 str_arg = args.arg
23 int_arg = args.int
24 flag_arg = args.flag
25 pos_arg = args.positional
26
27 print('str_arg = "{}"'.format(str_arg))
28 print('int_arg = "{}"'.format(int_arg))
29 print('flag_arg = "{}"'.format(flag_arg))
30 print('positional = "{}"'.format(pos_arg))
31
32 # --------------------------------------------------
33 if __name__ == '__main__':
34 main()
If you run without any arguments, you get a nice usage statement:
$ ./test.py
usage: test.py [-h] [-a str] [-i int] [-f] str
test.py: error: the following arguments are required: str
$ ./test.py foobar
str_arg = ""
int_arg = "0"
flag_arg = "False"
positional = "foobar"
$ ./test.py -a XYZ -i 42 -f foobar
str_arg = "XYZ"
int_arg = "42"
flag_arg = "True"
Please RTFM (https://docs.python.org/3/library/argparse.html\ to find all the other Fine things you can do with argparse.
CSV Files
Delimited text files are a standard way to distribute non/semi-hierarchical data -- e.g., records that can be represented each on one line. (When you get into data that have relationships, e.g., parents/children, then structures like XML and JSON are more appropriate, which is not to say that people haven't sorely abused this venerable format, e.g., GFF3.) Let's first take a look at the csv
module in Python (https://docs.python.org/3/library/csv.html\ to parse the output from Centrifuge (http://www.ccb.jhu.edu/software/centrifuge/.
For this, we'll use some data from a study from Yellowstone National Park (https://www.imicrobe.us/sample/view/1378\ a file ("YELLOWSTONE_SMPL_20723.sum") showing the taxonomy ID for each read it was able to classify and 2) a file ("YELLOWSTONE_SMPL_20723.tsv") of the complete taxonomy information for each taxonomy ID. One record from the first looks like this:
readID: Yellowstone_READ_00007510
seqID: cid|321327
taxID: 321327
score: 640000
2ndBestScore: 0
hitLength: 815
queryLength: 839
numMatches: 1
One from the second looks like this:
name: synthetic construct
taxID: 32630
taxRank: species
genomeSize: 26537524
numReads: 19
numUniqueReads: 19
abundance: 0.0
Read Count by taxID
Let's write a program that shows a table of the number of records for each "taxID":
$ cat -n read_count_by_taxid.py
1 #!/usr/bin/env python3
2 """Counts by taxID"""
3
4 import csv
5 import os
6 import sys
7 from collections import defaultdict
8
9 args = sys.argv[1:]
10
11 if len(args) != 1:
12 print('Usage: {} SAMPLE.SUM'.format(os.path.basename(sys.argv[0])))
13 sys.exit(1)
14
15 sum_file = args[0]
16
17 _, ext = os.path.splitext(sum_file)
18 if not ext == '.sum':
19 print('File extention "{}" is not ".sum"'.format(ext))
20 sys.exit(1)
21
22 counts = defaultdict(int)
23 with open(sum_file) as csvfile:
24 reader = csv.DictReader(csvfile, delimiter='\t')
25 for row in reader:
26 taxID = row['taxID']
27 counts[taxID] += 1
28
29 print('\t'.join(['count', 'taxID']))
30 for taxID, count in counts.items():
31 print('\t'.join([str(count), taxID]))
As always, it prints a "usage" statement when run with no arguments. It also uses the os.path.splitext
function to get the file extension and make sure that it is ".sum." Finally, if the input looks OK, then it uses the csv.DictReader
module to parse each record of the file into a dictionary:
$ ./read_count_by_taxid.py
Usage: read_count_by_taxid.py SAMPLE.SUM
$ ./read_count_by_taxid.py YELLOWSTONE_SMPL_20723.tsv
File extention ".tsv" is not ".sum"
$ ./read_count_by_taxid.py YELLOWSTONE_SMPL_20723.sum
count taxID
80 321332
6432 321327
19 32630
That's a start, but most people would rather see the a species name rather than the NCBI taxonomy ID, so we'll need to go look up the taxIDs in the ".tsv" file:
$ cat -n read_count_by_tax_name.py
1 #!/usr/bin/env python3
2 """Counts by tax name"""
3
4 import csv
5 import os
6 import sys
7 from collections import defaultdict
8
9 args = sys.argv[1:]
10
11 if len(args) != 1:
12 print('Usage: {} SAMPLE.SUM'.format(os.path.basename(sys.argv[0])))
13 sys.exit(1)
14
15 sum_file = args[0]
16
17 basename, ext = os.path.splitext(sum_file)
18 if not ext == '.sum':
19 print('File extention "{}" is not ".sum"'.format(ext))
20 sys.exit(1)
21
22 tsv_file = basename + '.tsv'
23 if not os.path.isfile(tsv_file):
24 print('Cannot find expected TSV "{}"'.format(tsv_file))
25 sys.exit(1)
26
27 tax_name = {}
28 with open(tsv_file) as csvfile:
29 reader = csv.DictReader(csvfile, delimiter='\t')
30 for row in reader:
31 tax_name[row['taxID']] = row['name']
32
33 counts = defaultdict(int)
34 with open(sum_file) as csvfile:
35 reader = csv.DictReader(csvfile, delimiter='\t')
36 for row in reader:
37 taxID = row['taxID']
38 counts[taxID] += 1
39
40 print('\t'.join(['count', 'taxID']))
41 for taxID, count in counts.items():
42 name = tax_name.get(taxID) or 'NA'
43 print('\t'.join([str(count), name]))
$ ./read_count_by_tax_name.py YELLOWSTONE_SMPL_20723.sum
count taxID
6432 Synechococcus sp. JA-3-3Ab
80 Synechococcus sp. JA-2-3B'a(2-13)
19 synthetic construct
tabchk
A huge chunk of my time is spent doing ETL operations -- extract, transform, load -- meaning someone sends me data (Excel or delimited-text, JSON/XML), and I put it into some sort of database. I usually want to inspect the data to see what it looks like, and it's hard to see the data when it's in columnar format. I'd rather see it formatted vertically. E.g., here is an export of some iMicrobe sample annotations:
$ tabchk.py -d imicrobe-blast-annots.txt
// ****** Record 1 ****** //
sample_id : 1
project_name : Acid Mine Drainage Metagenome
ontologies : ENVO:01000033 (oceanic pelagic zone biome),ENVO:00002149 (sea water),ENVO:00000015 (ocean)
project_id : 1
source_mat_id : 33
sample_type : metagenome
sample_description : ACID_MINE_02 - 5-Way (CG) Acid Mine Drainage Biofilm
sample_name : ACID_MINE_02
collection_start_time : 2002-03-01 00:00:00.0
collection_stop_time : 2002-03-01 00:00:00.0
site_name : Richmond Mine
latitude : 40.666668
longitude : -122.51667
site_description : Acid Mine Drainage
country : UNITED STATES
region : Iron Mountain, CA
library_acc : JGI_AMD_5WAY_IRNMTN_LIB_20020301
sequencing_method : dideoxysequencing (Sanger)
dna_type : gDNA
num_of_reads : 180713
habitat_name : waste water
temperature : 42
sample_acc : JGI_AMD_5WAY_IRNMTN_SMPL_20020301
If you open that file in a text editor, you will see there are many more column headers than are shown here. The lines are extremely long, and it's a pretty sparse matrix of data. The -d
flag to the program indicates to show a "dense" matrix, i.e., leave out the empty fields. I find the above format much easier to read. Here is the program to do that:
$ cat -n tabchk.py
1 #!/usr/bin/env python3
2
3 import argparse
4 import csv
5 import os
6 import sys
7
8 # --------------------------------------------------
9 def get_args():
10 parser = argparse.ArgumentParser(description='Check a delimited text file')
11 parser.add_argument('file', metavar='str', help='File')
12 parser.add_argument('-s', '--sep', help='Field separator',
13 metavar='str', type=str, default='\t')
14 parser.add_argument('-l', '--limit', help='How many records to show',
15 metavar='int', type=int, default=1)
16 parser.add_argument('-d', '--dense', help='Not sparse (skip empty fields)',
17 action='store_true')
18 return parser.parse_args()
19
20 # --------------------------------------------------
21 def main():
22 args = get_args()
23 file = args.file
24 limit = args.limit
25 sep = args.sep
26 dense = args.dense
27
28 if not os.path.isfile(file):
29 print('"{}" is not a file'.format(file))
30 sys.exit(1)
31
32 with open(file) as csvfile:
33 reader = csv.DictReader(csvfile, delimiter=sep)
34
35 for i, row in enumerate(reader):
36 vals = dict([x for x in row.items() if x[1] != '']) if dense else row
37 flds = vals.keys()
38 longest = max(map(len, flds))
39 fmt = '{:' + str(longest + 1) + '}: {}'
40 print('// ****** Record {} ****** //'.format(i+1))
41 for key, val in vals.items():
42 print(fmt.format(key, val))
43
44 if i + 1 == limit:
45 break
46
47 # --------------------------------------------------
48 if __name__ == '__main__':
49 main()
FASTA
Now let's finally get into parsing good, old FASTA files. The argparse
and csv
modules are standard in Python, but for FASTA we're going to need to install the BioPython (http://biopython.org/\ module. This should work for you:
$ python3 -m pip install biopython
For this exercise, I'll use a few reads from the Global Ocean Sampling Expedition (https://imicrobe.us/sample/view/578\:
$ head -5 CAM_SMPL_GS108.fa
>CAM_READ_0231669761 /library_id="CAM_LIB_GOS108XLRVAL-4F-1-400" /sample_id="CAM_SMPL_GS108" raw_id=SRA_ID=SRR066139.70645 raw_id=FG67BMZ02PUFIF
ATTTACAATAATTTAATAAAATTAACTAGAAATAAAATATTGTATGAAAATATGTTAAATAATGAAAGTTTTT
CAGATCGTTTAATAATATTTTTCTTCCATTTTGCTTTTTTCTAAAATTGTTCAAAAACAAACTTCAAAGGAAA
ATCTTCAAAATTTACATGATTTTATATTTAAACAAATAGAGTTAAGTATAAGAGAAATTGGATATGGTGATGC
TTCAATAAATAAAAAAATGAAAGAGTATGTCAATGTGATGTACGCAATAATTGACAAAGTTGATTCATGGGAA
Let's write a script that mimics seqmagick
:
$ seqmagick info CAM_SMPL_GS108.fa
name alignment min_len max_len avg_len num_seqs
CAM_SMPL_GS108.fa FALSE 168 440 325.60 5
We'll skip the "alignment" and just do min/max/avg lengths and the number of sequences. You can pretty much copy and paste the example code from http://biopython.org/wiki/SeqIO. Here is the output from our script, "seqmagique.py":
$ ./seqmagique.py *.fa
name min_len max_len avg_len num_seqs
CAM_SMPL_GS108.fa 168 440 325.6 5
CAM_SMPL_GS112.fa 71 517 398.4 5
The code to produce this builds on our earlier skills of lists and dictionaries as we will parse each file and save a dictionary of stats into a list, then we will iterate over that list at the end to show the output.
$ cat -n seqmagique.py
1 #!/usr/bin/env python3
2 """Mimic seqmagick, print stats on FASTA sequences"""
3
4 import os
5 import sys
6 from statistics import mean
7 from Bio import SeqIO
8
9 files = sys.argv[1:]
10
11 if len(files) < 1:
12 print('Usage: {} F1.fa [F2.fa...]'.format(os.path.basename(sys.argv[0])))
13 sys.exit(1)
14
15 info = []
16 for file in files:
17 lengths = []
18 for record in SeqIO.parse(file, "fasta"):
19 lengths.append(len(record.seq))
20
21 info.append({'name': os.path.basename(file),
22 'min_len': min(lengths),
23 'max_len': max(lengths),
24 'avg_len': mean(lengths),
25 'num_seqs': len(lengths)})
26
27 if len(info):
28 longest_file_name = max([len(f['name']) for f in info])
29 fmt = '{:' + str(longest_file_name) + '} {:>10} {:>10} {:>10} {:>10}'
30 flds = ['name', 'min_len', 'max_len', 'avg_len', 'num_seqs']
31 print(fmt.format(*flds))
32 for rec in info:
33 print(fmt.format(*[rec[fld] for fld in flds]))
34 else:
35 print('I had trouble parsing your data')
FASTA subset
Sometimes you may only want to use part of a FASTA file, e.g., you want the first 1000 sequences to test some code, or you have samples that vary wildly in size and you want to sub-sample them down to an equal number of reads. Here is a Python program that will write the first N samples to a given output directory:
$ cat -n fa_subset.py
1 #!/usr/bin/env python3
2 """Subset FASTA files"""
3
4 import argparse
5 import os
6 import sys
7 from Bio import SeqIO
8
9 # --------------------------------------------------
10 def get_args():
11 """get args"""
12 parser = argparse.ArgumentParser(description='Split FASTA files')
13 parser.add_argument('fasta', help='FASTA input file', metavar='FILE')
14 parser.add_argument('-n', '--num', help='Number of records per file',
15 type=int, metavar='NUM', default=500000)
16 parser.add_argument('-o', '--out_dir', help='Output directory',
17 type=str, metavar='DIR', default='subset')
18 return parser.parse_args()
19
20 # --------------------------------------------------
21 def main():
22 """main"""
23 args = get_args()
24 fasta = args.fasta
25 out_dir = args.out_dir
26 num_seqs = args.num
27
28 if not os.path.isfile(fasta):
29 print('--fasta "{}" is not valid'.format(fasta))
30 sys.exit(1)
31
32 if os.path.dirname(fasta) == out_dir:
33 print('--outdir cannot be the same as input files')
34 sys.exit(1)
35
36 if num_seqs < 1:
37 print("--num cannot be less than one")
38 sys.exit(1)
39
40 if not os.path.isdir(out_dir):
41 os.mkdir(out_dir)
42
43 basename = os.path.basename(fasta)
44 out_file = os.path.join(out_dir, basename)
45 out_fh = open(out_file, 'wt')
46 num_written = 0
47
48 for record in SeqIO.parse(fasta, "fasta"):
49 SeqIO.write(record, out_fh, "fasta")
50 num_written += 1
51
52 if num_written == num_seqs:
53 break
54
55 print('Done, wrote {} sequence{} to "{}"'.format(
56 num_written, '' if num_written == 1 else 's', out_file))
57
58 # --------------------------------------------------
59 if __name__ == '__main__':
60 main()
FASTA splitter
I seem to have implemented my own FASTA splitter a few times in as many languages. Here is one that writes a maximum number of sequences to each output file. It would not be hard to instead write a maximum number of bytes, but, for the short reads I usually handle, this works fine. Again I will use the BioPython SeqIO
module to parse the FASTA files
$ cat -n fasplit.py
1 #!/usr/bin/env python3
2 """split FASTA files"""
3
4 import argparse
5 import os
6 from Bio import SeqIO
7
8 # --------------------------------------------------
9 def main():
10 """main"""
11 args = get_args()
12 fasta = args.fasta
13 out_dir = args.out_dir
14 max_per = args.num
15
16 if not os.path.isfile(fasta):
17 print('--fasta "{}" is not valid'.format(fasta))
18 exit(1)
19
20 if not os.path.isdir(out_dir):
21 os.mkdir(out_dir)
22
23 if max_per < 1:
24 print("--num cannot be less than one")
25 exit(1)
26
27 i = 0
28 nseq = 0
29 nfile = 0
30 out_fh = None
31 basename, ext = os.path.splitext(os.path.basename(fasta))
32
33 for record in SeqIO.parse(fasta, "fasta"):
34 if i == max_per:
35 i = 0
36 if out_fh is not None:
37 out_fh.close()
38 out_fh = None
39
40 i += 1
41 nseq += 1
42 if out_fh is None:
43 nfile += 1
44 path = os.path.join(out_dir, basename + '.' + str(nfile) + ext)
45 out_fh = open(path, 'wt')
46
47 SeqIO.write(record, out_fh, "fasta")
48
49 print('Done, wrote {} sequence{} to {} file{}'.format(
50 nseq, '' if nseq == 1 else 's',
51 nfile, '' if nfile == 1 else 's'))
52
53 # --------------------------------------------------
54 def get_args():
55 """get args"""
56 parser = argparse.ArgumentParser(description='Split FASTA files')
57 parser.add_argument('-f', '--fasta', help='FASTA input file',
58 type=str, metavar='FILE', required=True)
59 parser.add_argument('-n', '--num', help='Number of records per file',
60 type=int, metavar='NUM', default=50)
61 parser.add_argument('-o', '--out_dir', help='Output directory',
62 type=str, metavar='DIR', default='fasplit')
63 return parser.parse_args()
64
65 # --------------------------------------------------
66 if __name__ == '__main__':
67 main()
If you type make
in the "python/fasta-splitter" directory, you should see:
$ make
./fasplit.py -f POV_L.Sum.O.1000m_reads.fa -o pov -n 100
Done, wrote 2061 sequences to 21 files
We can verify that things worked:
$ for file in *; do echo -n $file && grep '^>' $file | wc -l; done
POV_L.Sum.O.1000m_reads.1.fa 100
POV_L.Sum.O.1000m_reads.10.fa 100
POV_L.Sum.O.1000m_reads.11.fa 100
POV_L.Sum.O.1000m_reads.12.fa 100
POV_L.Sum.O.1000m_reads.13.fa 100
POV_L.Sum.O.1000m_reads.14.fa 100
POV_L.Sum.O.1000m_reads.15.fa 100
POV_L.Sum.O.1000m_reads.16.fa 100
POV_L.Sum.O.1000m_reads.17.fa 100
POV_L.Sum.O.1000m_reads.18.fa 100
POV_L.Sum.O.1000m_reads.19.fa 100
POV_L.Sum.O.1000m_reads.2.fa 100
POV_L.Sum.O.1000m_reads.20.fa 100
POV_L.Sum.O.1000m_reads.21.fa 61
POV_L.Sum.O.1000m_reads.3.fa 100
POV_L.Sum.O.1000m_reads.4.fa 100
POV_L.Sum.O.1000m_reads.5.fa 100
POV_L.Sum.O.1000m_reads.6.fa 100
POV_L.Sum.O.1000m_reads.7.fa 100
POV_L.Sum.O.1000m_reads.8.fa 100
POV_L.Sum.O.1000m_reads.9.fa 100
GFF
Two of the most common output files in bioinformatics, GFF (General Feature Format) and BLAST's tab/CSV files do not include headers, so it's up to you to merge in the headers. Additionally, some of the lines may be comments (they start with "#" just like bash and Python), so you should skip those. Further, the last field in GFF is basically a dumping ground for whatever else the data provider felt like putting there. Usually it's a bunch of "key=value" pairs, but there's no guarantee. Let's take a look at parsing the GFF output from Prodigal:
$ cat -n parse_prodigal_gff.py
1 #!/usr/bin/env python3
2
3 import argparse
4 import csv
5 import os
6 import sys
7
8 # --------------------------------------------------
9 def get_args():
10 parser = argparse.ArgumentParser(description='Parse Prodigal GFF')
11 parser.add_argument('gff', metavar='GFF', help='Prodigal GFF output')
12 parser.add_argument('-m', '--min', help='Minimum score',
13 metavar='float', type=float, default=0.0)
14 return parser.parse_args()
15
16 # --------------------------------------------------
17 def main():
18 args = get_args()
19 gff_file = args.gff
20 min_score = args.min
21 flds = 'seqname source feature start end score strand frame attribute'.split()
22
23 for line in open(gff_file):
24 if line[0] == '#':
25 continue
26
27 row = dict(zip(flds, line.split('\t')))
28 attrs = {}
29 if 'attribute' in row:
30 for fld in row['attribute'].split(';'):
31 if '=' in fld:
32 name, val = fld.split('=')
33 attrs[name] = val
34
35 if 'score' in attrs and float(attrs['score']) > min_score:
36 print(row)
37 break
38
39 # --------------------------------------------------
40 if __name__ == '__main__':
41 main()