Unix exercises
Note: When you see a "$" given in the example prompts, it is a metacharacter indicating that this is the prompt for a normal (not super-user) account. Your default prompt may be different, and it is highly configurable (search for "PS1 unix prompt" to learn more). Anyway, point is that you should type (copy/paste) all the stuff after the $. If you ever see a prompt with "#," it's indicating a command that should be run as the super-user/root account, e.g., installing some software into a system-wide directory so it can be shared by all users.
Number of unique users
Find the number of unique users on a shared system
We know that "w" will tell us the users logged in. Try it now on a system that has many users (i.e., not your laptop) and see the output. We'll connect the output of "w" to "head" using a pipe "|", but we only want the first five lines:
$ w | head -5
14:36:01 up 21 days, 21:51, 176 users, load average: 3.83, 4.31, 4.47
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
antontre pts/1 149.165.156.129 Sat14 3days 0.10s 0.10s /bin/sh -i
huddack pts/3 128.4.131.189 09:38 4:56m 0.15s 0.15s -bash
We want to see the first five users, not the first five lines of output. To skip the first two lines of headers from "w," we first pipe "w" into "awk" and tell it we only want to see output when the Number of Records (NR) is greater than 2:
$ w | awk 'NR>2' | head -5
antontre pts/1 149.165.156.129 Sat14 4days 0.10s 0.10s /bin/sh -i
huddack pts/3 128.4.131.189 09:38 5:13m 0.15s 0.15s -bash
antontre pts/5 149.165.156.129 Sun19 2days 0.14s 0.14s /bin/sh -i
minyard pts/8 129.114.64.18 29Jul16 4:24m 3:46m 3:46m top
antontre pts/11 149.165.156.129 Sun23 2days 0.24s 0.24s /bin/sh -i
Let's "cut" out just the first column of data. The manpage for "cut" says that it defaults to using the tab character to determine columns, so we'll need to tell it to use spaces:
$ w | awk 'NR>2' | head -5 | cut -d ' ' -f 1
antontre
huddack
antontre
minyard
antontre
We can see right away that the some users like "antontre" are logged in multiple times, so let's "uniq" that output:
$ w | awk 'NR>2' | head -5 | cut -d ' ' -f 1 | uniq
antontre
huddack
antontre
minyard
antontre
Hmm, that's not right. Remember I said earlier that "uniq" only works on sorted input? So let's sort those names first:
$ w | awk 'NR>2' | head -5 | cut -d ' ' -f 1 | sort | uniq
antontre
huddack
minyard
OK, that is correct. Now let's remove the "head -5" and use "wc" to count all the lines (-l) of input:
$ w | awk 'NR>2' | cut -d ' ' -f 1 | sort | uniq | wc -l
138
So what you see is that we're connecting small, well-defined programs together using pipes to connect the "standard input" (STDIN) and "standard output (STDOUT) streams. There's a third basic file handle in Unix called "standard error" (STDERR) that we'll come across later. It's a way for programs to report problems without simply dying. You can redirect errors into a file like so:
$ program 2>err
$ program 1>out 2>err
The first example puts STDERR into a file called "err" and lets STDOUT print to the terminal. The second example captures STDOUT into a file called "out" while STDERR goes to "err."
Protip: Sometimes a program will complain about things that you cannot fix, e.g., "find" may complain about file permissions that you don't care about. In those cases, you can redirect STDERR to a special filehandle called "/dev/null" where they are forgotten forever. Kind of like the "memory hole" in 1984.
find / -name my-file.txt 2>/dev/null
Count "oo" words
On almost every Unix system, you can find "/usr/share/dict/words." Let's use "grep" to find how many have the "oo" vowel combination. It's a long list, so I'll pipe it into "head" to see just the first five:
$ grep 'oo' /usr/share/dict/words | head -5
abloom
aboon
aboveproof
abrood
abrook
Yes, that works, so redirect those words into a file and count them:
$ grep 'oo' /usr/share/dict/words > oo-words
$ wc -l !$
10460 oo-words
Let's count them directly out of "grep":
$ grep 'oo' /usr/share/dict/words | wc -l
10460
How many of those words additionally contain the "ow" sequence?
$ grep 'oo' /usr/share/dict/words | grep 'ow' | wc -l
158
How many do not contain the "ow" sequence?
$ grep 'oo' /usr/share/dict/words | grep -v 'ow' | wc -l
10302
Do those numbers add up?
$ bc <<< 158+10302
10460
Excellent. Smithers, massage my brain.
Gapminder
Do the following:
$ git clone https://github.com/kyclark/metagenomics-book
$ cd metagenomics-book/problems/gapminder/data
How many files are in the "data" directory?
$ ls | wc -l
How many lines are in each/all of the files?
$ wc -l *
You can use cat
to spew at the entire contents of a file into your shell, but if you'd just like to see the top of a file, you can use:
$ head Trinidad_and_Tobago.cc.txt
If you only want to see 5 lines, use -n 5
or -5
.
For our exercise, we'd like to combine all the files into one file we can analyze. That's easy enough with:
$ cat * > all.txt
Let's use head
to look at the top of file:
$ head -5 all.txt
Afghanistan 1997 22227415 Asia 41.763 635.341351
Afghanistan 2002 25268405 Asia 42.129 726.7340548
Afghanistan 2007 31889923 Asia 43.828 974.5803384
Afghanistan 1952 8425333 Asia 28.801 779.4453145
Afghanistan 1957 9240934 Asia 30.332 820.8530296
Hmm, there are no column headers. Let's fix that. There's one file that's pretty different in content (it has only one line) and name ("country.cc.txt"):
$ cat country.cc.txt
country year pop continent lifeExp gdpPercap
Those are the headers that you can combine to all the other files to get named columns, something very important if you want to look at the data in Excel and R/Python data frames.
$ rm all.txt
$ mv country.cc.txt headers
$ cat headers *.txt > all.txt
$ head -5 all.txt | column -t
country year pop continent lifeExp gdpPercap
Afghanistan 1997 22227415 Asia 41.763 635.341351
Afghanistan 2002 25268405 Asia 42.129 726.7340548
Afghanistan 2007 31889923 Asia 43.828 974.5803384
Afghanistan 1952 8425333 Asia 28.801 779.4453145
Yes, that looks much better. Double-check that the number of lines in the all.txt
match the number of lines of input:
$ wc -l *.cc.txt headers
$ wc -l all.txt
How many observations do we have for 1952?
$ grep 1952 all.txt | wc -l
$ cut -f 2 *.cc.txt | grep 1952 | wc -l
Those numbers aren't the same! Why is that?
$ grep 1952 all.txt | cut -f 2 | sort | uniq -c
142 1952
1 1982
1 1987
$ grep 1952 all.txt | grep 198[27]
Lebanon 1982 3086876 Asia 66.983 7640.519521
Mozambique 1987 12891952 Africa 42.861 389.8761846
How many observations for every year?
$ cut -f 2 *.cc.txt | sort | uniq -c
How many observations are present for Africa?
$ grep Africa all.txt | wc -l
How many for each continent?
$ cut -f 4 *.cc.txt | sort | uniq -c
What was the world population in 1952? As we've seen, just using grep 1952
is not sufficient. We want to take the 3rd column if the 2nd column is equal to "1952." awk
will let us do just that. Normally awk
will split on whitespace, so we need to use -F"\t"
to tell it to split on the tab (\t
) character. Use man awk
to learn more.
$ awk -F"\t" '$2 == "1952" { print $3 }' *.cc.txt
I'll bet you didn't notice that one of those numbers was in scientific notation. That's going to cause a problem. Here it is:
$ awk -F"\t" '$2 == "1952" { print $3 }' *.cc.txt | grep [a-z]
3.72e+08
We have to throw in a grep -v
to get rid of that (the -v
reverses the match), then use the paste
command is used to put a "+" in between all the numbers:
$ awk -F"\t" '$2 == "1952" { print $3 }' *.cc.txt | grep -v [a-z] | paste -sd+ -
and then we pipe that to the bc
calculator:
$ awk -F"\t" '$2 == "1952" { print $3 }' *.cc.txt | grep -v [a-z] | paste -sd+ - | bc
2034957150.999989
It bothers me that it's not an integer, so I'm going to use printf
in the awk
command to trim that:
$ awk -F"\t" '$2 == "1952" { printf "%d\n", $3 }' *.cc.txt | grep -v [a-z] | paste -sd+ - | bc
2406957150
How did population change over the years? Let's put a list of the unique years into a file called "years" and then cat
over that to run the above for each year:
$ cut -f 2 *.txt | sort | uniq > years
$ for year in `cat years`; do echo -n $year ": " && awk -F"\t" "\$2 == $year { printf \"%d\n\", \$3 }" *.cc.txt | grep -v [a-z] | paste -sd+ - | bc; done
1952 : 2406957150
1957 : 2664404580
1962 : 2899782974
1967 : 3217478384
1972 : 3576977158
1977 : 3930045807
1982 : 4289436840
1987 : 4691477418
1992 : 5110710260
1997 : 5515204472
2002 : 5886977579
2007 : 6251013179
That's kind of useful! Here's how I might put that into a script:
$ cat pop-years.sh
#!/bin/bash
set -u
YEARS="years"
cut -f 2 ./*.cc.txt | sort | uniq > "$YEARS"
NUM=$(wc -l $YEARS | awk '{print $1}')
if [[ "$NUM" -lt 1 ]]; then
echo "No years ($NUM)!"
exit 1
fi
while read -r YEAR; do
echo -n "$YEAR: "
awk -F"\t" "\$2 == $YEAR { printf \"%d\\n\", \$3 }" ./*.cc.txt | grep -v "[a-z]" | paste -sd+ - | bc
done < "$YEARS"
$ ./pop-years.sh
1952: 2406957150
1957: 2664404580
1962: 2899782974
1967: 3217478384
1972: 3576977158
1977: 3930045807
1982: 4289436840
1987: 4691477418
1992: 5110710260
1997: 5515204472
2002: 5886977579
2007: 6251013179
How has life expectancy changed over the years? For this we'll need to write a little Python program. I'll cat
the program so you can see it. You can type this in with nano
and then do chmod +x avg.py
to make it executable (or use the one I added):
$ cat avg.py
#!/usr/bin/env python3
import sys
args = list(map(float, sys.argv[1:]))
print(str(sum(args) // len(args)))
$ for year in `cat years`; do echo -n "$year: " && grep $year *.txt | cut -f 5 | xargs ./avg.py; done
1952: 49.0
1957: 51.0
1962: 53.0
1967: 55.0
1972: 57.0
1977: 59.0
1982: 61.0
1987: 63.0
1992: 64.0
1997: 65.0
2002: 65.0
2007: 66.0
How many observations where the life expectancy ("lifeExp," field #5) is greater than 40? For this, let's use the awk
tool.
$ awk -F"\t" '$5 > 40' all.txt | wc -l
How many of those are from Africa? We can either use cut
to get the 4th field or ask awk
to print the 4th field for us:
$ awk -F"\t" '$5 > 40' all.txt | cut -f 4 | grep Africa | wc -l
$ awk -F"\t" '$5 > 40 {print $4}' all.txt | grep Africa | wc -l
How many countries had a life expectancy greater than 70, grouped by year?
$ awk -F"\t" '$5 > 70 { print $2 }' *.cc.txt | sort | uniq -c
5 1952
9 1957
16 1962
25 1967
30 1972
38 1977
44 1982
49 1987
54 1992
65 1997
75 2002
83 2007
How could we add continent to this?
$ awk -F"\t" '$5 > 70 { print $2 ":" $4 }' *.cc.txt | sort | uniq -c
As you look at the data and want to ask more complicated questions like how does gdpPercap
affect lifeExp
, you'll find you need more advanced tools like Python or R. Now that the data has been collated and the columns named, that will be much easier.
What if we want to add headers to each of the files?
$ mkdir wheaders
$ for file in *.txt; do cat headers $file > wheaders/$file; done
$ wc -l wheaders/* | head -5
13 wheaders/Afghanistan.cc.txt
13 wheaders/Albania.cc.txt
13 wheaders/Algeria.cc.txt
13 wheaders/Angola.cc.txt
13 wheaders/Argentina.cc.txt
$ head wheaders/Vietnam.cc.txt
country year pop continent lifeExp gdpPercap
Vietnam 1952 26246839 Asia 40.412 605.0664917
Vietnam 1957 28998543 Asia 42.887 676.2854478
Vietnam 1962 33796140 Asia 45.363 772.0491602
Vietnam 1967 39463910 Asia 47.838 637.1232887
Vietnam 1972 44655014 Asia 50.254 699.5016441
Vietnam 1977 50533506 Asia 55.764 713.5371196
Vietnam 1982 56142181 Asia 58.816 707.2357863
Vietnam 1987 62826491 Asia 62.82 820.7994449
Vietnam 1992 69940728 Asia 67.662 989.0231487
Something with sequences
Now we will get some sequence data from the iMicrobe FTP site. Both "wget" and "ncftpget" will do the trick:
$ mkdir -p ~/work/abe487/contigs
$ cd !$
$ wget ftp://ftp.imicrobe.us/abe487/contigs/contigs.zip
Protip: How do we know we got the correct data? Go back and look at that FTP site, and you will see that there is a "contigs.zip.md5" file that we can "less" on the server to view the contents:
$ ncftp ftp://ftp.imicrobe.us/abe487/contigs
NcFTP 3.2.5 (Feb 02, 2011) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 128.196.131.100...
Welcome to the imicrobe.us repository
Logging in...
Login successful.
Logged in to ftp.imicrobe.us.
Current remote directory is /abe487/contigs.
ncftp /abe487/contigs > ls
contigs.zip contigs.zip.md5
ncftp /abe487/contigs > less contigs.zip.md5
1b7e58177edea28e6441843ddc3a68ab contigs.zip
ncftp /abe487/contigs > exit
You can read up on MD5 (https://en.wikipedia.org/wiki/Md5sum) to understand that this is a signature of the file. If we calculate the MD5 of the file we dowloaded and it matches what we see on the server, then we can be sure that we have the exact file that is on the FTP site:
$ md5sum contigs.zip
1b7e58177edea28e6441843ddc3a68ab contigs.zip
Yes, those two sums match. Note that sometimes the program is just called "md5."
So, back to the exercise. Let's unpack the contigs:
$ unzip contigs.zip
Archive: contigs.zip
inflating: group12_contigs.fasta
inflating: group20_contigs.fasta
inflating: group24_contigs.fasta
$ rm contigs.zip
These files are in FASTA format (https://en.wikipedia.org/wiki/FASTA_format), which basically looks like this:
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
Header lines start with ">", then the sequence follows. Sequences may be broken up over several lines of 50 or 80 characters, but it's just as common to see the sequences take only one (sometimes very long) line. Sequences may be nucleotides, proteins, very short DNA/RNA, longer contigs (shorter strands assembled into contiguous regions), or entire chromosomes or even genomes.
So, how many sequences are in "group12_contigs.fasta"? To answer, we just need to count how many times we see ">". We can do that with "grep":
$ grep > group12_contigs.fasta
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
What is going on? Remember when we captured the "oo" words that we used the ">" symbol to tell Unix to redirect the output of "grep" into a file. We need to tell Unix that we mean a literal greater-than sign by placing it in single or double quotes or putting a backslash in front of it:
$ grep '>' group12_contigs.fasta
$ grep \> group12_contigs.fasta
You should actually see nothing because something quite insidious happened with that first "grep" statement -- it overwrote our original "group12_contigs.fasta" with the result of "grep"ing for nothing, which is nothing:
$ ls -l group12_contigs.fasta
-rw-rw---- 1 kyclark staff 0 Aug 10 15:08 group12_contigs.fasta
Ugh, OK, I have to go back and "wget" the "contigs.zip" file to restore it. That's OK. Things like this happen all the time.
$ ls -lh group12_contigs.fasta
-rw-rw---- 1 kyclark staff 2.9M Aug 10 14:38 group12_contigs.fasta
Now that I have restored my data, I want to count how many greater-than signs in the file:
$ grep '>' group12_contigs.fasta | wc -l
132
Hey, I could see doing that often. Maybe we should make this into an "alias" (see above). The problem is that the "argument" to the function (the filename) is stuck in the middle of the chain of commands, so it would make it tricky to use an alias for this. We can create a bash function that we add to our .bashrc:
function countseqs() {
grep '>' $1 | wc -l
}
After you add this, remember to source this file to make it available:
$ source ~/.bashrc
$ countseqs group12_contigs.fasta
132
Same answer. Good. However, someone beat us to the punch. There is a powerful tool called "seqmagick" (https://github.com/fhcrc/seqmagick) that will do this (and much, much more). It's installed into the "hurwitzlab/bin" directory, or you can install it locally:
$ seqmagick info group12_contigs.fasta
name alignment min_len max_len avg_len num_seqs
group12_contigs.fasta FALSE 5136 116409 22974.30 132
Run "seqmagick -h" to see everything it can do.
Moving on, let's find how many contig IDs in "group12_contigs.fasta" contain the number "47":
$ grep 47 group12_contigs.fasta > group12_ids_with_47
[login3@~/work/sequences]$ cat !$
cat group12_ids_with_47
>Contig_247
>Contig_447
>Contig_476
>Contig_1947
>Contig_4764
>Contig_4767
>Contig_13471
Here we did a little "useless use of cat," but it's OK. We also could have used "less" to view the file. Here's another useless use of cat to copy a file:
$ cat group12_ids_with_47 > temp1_ids
Additionally, we want to copy the file again to make duplicates:
$ cp group12_ids_with_47 temp2_ids
How can we be sure these files are the same? Let's use "diff":
$ diff temp1_ids temp2_ids
You should see nothing, which is a case of "no news is good news." They don't differ in any way. We can verify this with "md5sum":
$ md5sum temp*
957390ab4c31db9500d148854f542eee temp1_ids
957390ab4c31db9500d148854f542eee temp2_ids
They are the same file. If there were even one character difference, they would generate different hashes.
Now we will create a file with duplicate IDs:
$ cat temp1_ids temp2_ids > duplicate_ids
Check contents of "duplicate_ids" using "less" or "cat." Now grab all of the contigs IDs from "group20_contigs.fasta" that contain the number "51." Concatenate the new IDs to the duplicate_ids file in a file called "multiple_ids":
$ cp duplicate_ids multiple_ids
$ grep 51 group20_contigs.fasta >> !$
grep 51 group20_contigs.fasta >> multiple_ids
Notice the ">>" arrows to indicate that we are appending to the existing "multiple_ids" file.
Remove the existing "temp" files using a "*" wildcard:
$ rm temp*
Now let's explore more of what "sort" and "uniq" can do for us. We want to find which IDs are unique and which are duplicated. If we read the manpage ("man uniq"), we see that there are "-d" and "-u" flags for doing just that. However, we've already seen that input to "uniq" needs to be sorted, so we need to remember to do that:
$ sort multiple_ids | uniq -d > temp1_ids
$ sort multiple_ids | uniq -u > temp2_ids
$ diff temp*
1,7c1,11
< >Contig_13471
< >Contig_1947
< >Contig_247
< >Contig_447
< >Contig_476
< >Contig_4764
< >Contig_4767
---
> >Contig_10051
> >Contig_1651
> >Contig_4851
> >Contig_5141
> >Contig_5143
> >Contig_5164
> >Contig_5170
> >Contig_5188
> >Contig_6351
> >Contig_9651
> >Contig_9851
Let's remove our temp files again and make a "clean_ids" file:
$ rm temp*
$ sort multiple_ids | uniq > clean_ids
$ wc -l multiple_ids clean_ids
25 multiple_ids
18 clean_ids
43 total
We can use "sed" to alter the IDs. The "s//" command say to "substitute" the first thing with the second thing, e.g., to replace all occurences of "foo" with "bar", use "s/foo/bar" (http://stackoverflow.com/questions/4868904/what-is-the-origin-of-foo-and-bar).
$ sed 's/C/c/' clean_ids
$ sed 's/_/./' clean_ids
$ sed 's/>//' clean_ids > newclean_ids
That last one removes the FASTA file artifact that identifies the beginning of an ID but is not part of the ID. We can use this with "seqmagick" now to extract those sequences and find out how many were found:
$ seqmagick convert --include-from-file newclean_ids group12_contigs.fasta newgroup12_contigs.fasta
$ seqmagick info !$
seqmagick info newgroup12_contigs.fasta
name alignment min_len max_len avg_len num_seqs
newgroup12_contigs.fasta FALSE 5587 30751 16768.14 7
We can get stats on all our files:
$ seqmagick info *fasta > fasta-info
$ cat !$
name alignment min_len max_len avg_len num_seqs
group12_contigs.fasta FALSE 5136 116409 22974.30 132
group20_contigs.fasta FALSE 5029 22601 7624.38 203
group24_contigs.fasta FALSE 5024 81329 12115.70 139
newgroup12_contigs.fasta FALSE 5587 30751 16768.14 7
We can use "cut" to view various columns:
$ cut -f 2 fasta-info
$ cut -f 2,4 fasta-info
$ cut -f 2-4 fasta-info
But it does not line up very nicely. We can use "column" to fix this:
$ cut -f 2-4 fasta-info | column -t
alignment min_len max_len
FALSE 5136 116409
FALSE 5029 22601
FALSE 5024 81329
FALSE 5587 30751