Sequence similarity
As our field deals with the sequences of unknown communities of organisms, we are very interested in comparing our samples to known sequences. There are many programs and algorithms for comparing strings, from the venerable BLAST (Basic Local Alignment Search Tool) to hashing to kmers.
K-mers
A "k-mer" is a k-length mer (from Greek meros "part") of contiguous sequence as in the word "polymer." So the given sequence has 3-mers like so:
AGCTTTTC
AGC
GCT
CTT
TTT
TTT
TTC
Here's a simple Perl script to get kmers from a sequence:
$ cat -n kmer1.pl6
1 #!/usr/bin/env perl6
2
3 sub MAIN (Str $input!, Int :$k=10) {
4 if $k < 0 {
5 note "k ($k) must be positive\n";
6 exit 1;
7 }
8
9 my $seq = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
10 my $n = $seq.chars - $k + 1;
11
12 if $n < 1 {
13 note "Cannot extract {$k}-mers from seq length {$seq.chars}";
14 exit 1;
15 }
16
17 for 0..^$n -> $i {
18 put $seq.substr($i, $k);
19 }
20 }
In lines 4-7, we first want to ensure that we have a positive value for k, so we need to explore this idea briefly. Another way to write this would be to move that condition into the signature:
$ cat -n pos1.pl6
1 #!/usr/bin/env perl6
2
3 sub MAIN (Int :$k where * > 0 = 10) {
4 put "k = $k";
5 }
$ ./pos1.pl6 -k=5
k = 5
$ ./pos1.pl6 -k=-1
Usage:
./pos1.pl6 [-k=<Int>]
Because that sort of thing is so useful, Perl let's us create our own types as a subset
of an existing type. I'll arbitrarily limit k to a positive integer less than 50:
$ cat -n pos2.pl6
1 #!/usr/bin/env perl6
2
3 subset SmallPosInt of Int where 0 < * < 50 ;
4 sub MAIN (SmallPosInt :$k=10) {
5 put "k = $k";
6 }
$ ./pos2.pl6 -k=5
k = 5
$ ./pos2.pl6 -k=-1
Usage:
./pos2.pl6 [-k=<Int>]
$ ./pos2.pl6 -k=51
Usage:
./pos2.pl6 [-k=<Int>]
As it happens, there is a built-in type that will constrain us to a positive integer called UInt
for "unsigned integer":
$ cat -n pos3.pl6
1 #!/usr/bin/env perl6
2
3 sub MAIN (UInt :$k=10) {
4 put "k = $k";
5 }
$ ./pos3.pl6 -k=5
k = 5
$ ./pos3.pl6 -k=-1
Usage:
./pos3.pl6 [-k=<Int>]
Now we can shorten our code and get type checking for free:
$ cat -n kmer2.pl6
1 #!/usr/bin/env perl6
2
3 sub MAIN (Str $input!, UInt :$k=10) {
4 my $seq = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
5 my $n = $seq.chars - $k + 1;
6
7 if $n < 1 {
8 note "Cannot extract {$k}-mers from seq length {$seq.chars}";
9 exit 1;
10 }
11
12 for 0..^$n -> $i {
13 put $seq.substr($i, $k);
14 }
15 }
$ ./kmer2.pl6 -k=20 input.txt
AGCTTTTCATTCTGACTGCA
GCTTTTCATTCTGACTGCAA
CTTTTCATTCTGACTGCAAC
TTTTCATTCTGACTGCAACG
TTTCATTCTGACTGCAACGG
TTCATTCTGACTGCAACGGG
At line 5, use a simple formula to determine the number of k-mers we can extract from the given sequence. At lines 7-10, we decide if we can continue based on the input from the user. Lines 12-14 should look somewhat familiar by now. Since we're going to use the zero-based substr
to extract part of the sequence, we need to use the "..^" range operator to go up to but not including our value for $n
. After that, we just put
the extracted k-mer.
Here's a slightly shorter version that again uses the map
function (https://docs.perl6.org/routine/map) to eschew a for
loop:
$ cat -n kmer3.pl6
1 #!/usr/bin/env perl6
2
3 sub MAIN (Str $input!, UInt :$k=10) {
4 my $seq = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
5 my $n = $seq.chars - $k + 1;
6
7 if $n < 1 {
8 note "Cannot extract {$k}-mers from seq length {$seq.chars}";
9 exit 1;
10 }
11
12 put join "\n", map { $seq.substr($_, $k) }, 0..^$n;
13 }
It's worth spending time to really learn map
as it's a very safe and efficient function. It takes two arguments:
- a code block to run
- a list of things
The code block is applied to each element in the list coming in from the right and is emitted on the left:
new <- map { code } <- original
It's important to remember that a map
will always return the same number of elements it was given as opposed to a grep
which looks very similar:
> map *.uc, <foo bar baz>
(FOO BAR BAZ)
> grep /ba/, <foo bar baz>
(bar baz)
So the "list" argument to map
in line 12 is the range 0..^$n
. Each number in turn goes into the code block as the "topic" $_
where it is used in the substr
call to extract the k-mer which are then returned as a new list to the join
which then puts a newline between them before the call to put
.
I'll present one more version that puts the kmers into an array using gather/take
(https://docs.perl6.org/language/control#gather/take):
$ cat -n kmer4.pl6
1 #!/usr/bin/env perl6
2
3 sub MAIN (Str $input!, Int :$k=10) {
4 my $seq = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
5 my $n = $seq.chars - $k + 1;
6 my @kmers = gather for 0..^$n -> $i {
7 take $seq.substr($i, $k);
8 }
9
10 dd @kmers;
11 }
The dd
is the built-in "data dumper" that shows you a textual representation of a data structure:
$ ./kmer4.pl6 input.txt
Array @kmers = ["AGCTTTTCAT", "GCTTTTCATT",
"CTTTTCATTC", "TTTTCATTCT", "TTTCATTCTG",
"TTCATTCTGA", "TCATTCTGAC", "CATTCTGACT",
"ATTCTGACTG", "TTCTGACTGC", "TCTGACTGCA",
"CTGACTGCAA", "TGACTGCAAC", "GACTGCAACG",
"ACTGCAACGG", "CTGCAACGGG"]
Kmers from FASTA
Now let's combine our parsing of FASTA with extraction of k-mers:
$ cat -n fasta-kmer1.pl6
1 #!/usr/bin/env perl6
2
3 use Bio::SeqIO;
4
5 sub MAIN (Str $file!, UInt :$k=10) {
6 die "Not a file ($file)" unless $file.IO.f;
7 my $seqIO = Bio::SeqIO.new(format => 'fasta', file => $file);
8
9 while (my $seq = $seqIO.next-Seq) {
10 put join "\n", get-kmers($k, $seq.seq);
11 }
12 }
13
14 sub get-kmers(Int $k, Str $str) {
15 my $n = $str.chars - $k + 1;
16 map { $str.substr($_, $k) }, 0..^$n;
17 }
I copied the FASTA script from the previous chapter and introduced a function called get-kmers
. You've already been writing functions, of course, like MAIN
and calling functions like join
and put
, but now we're calling a function we've written. We define that it takes two positional arguments, an integer $k
and a string $string
. Right away the type checking caught an error as I tried to pass in the $seq
object and not the $seq.seq
string of the sequence.
Just like with our MAIN
, we can change our positional arguments into named arguments by placing a :
before the names (line 14 below):
$ cat -n fasta-kmer2.pl6
1 #!/usr/bin/env perl6
2
3 use Bio::SeqIO;
4
5 sub MAIN (Str $file! where *.IO.f, UInt :$k=10) {
6 my $seqIO = Bio::SeqIO.new(format => 'fasta', file => $file);
7
8 while (my $seq = $seqIO.next-Seq) {
9 put join "\n", get-kmers(str => $seq.seq, k => $k);
10 }
11 }
12
13 sub get-kmers(Int :$k, Str :$str) {
14 my $n = $str.chars - $k + 1;
15 map { $str.substr($_, $k) }, 0..^$n;
16 }
This means we have to call get-kmers
with Pairs for arguments (line 10), but it also means the arguments can be specified in any order. Generally, if a function takes more than 2 or 3 arguments, I would recommend using named arguments.
Something else you might have noticed in the above version is that I moved the check of the file-ness of $input
into a constraint in the MAIN
signature. If you needed this contraint more than once, it would behoove you to create a new subset
.
Here is another version that again dips into the function programming world and uses a new string function called rotor
(http://perl6.party/post/Perl-6-.rotor-The-King-of-List-Manipulation):
$ cat -n fasta-kmer3.pl6
1 #!/usr/bin/env perl6
2
3 use Bio::SeqIO;
4
5 sub MAIN (Str $file!, UInt :$k=10) {
6 die "Not a file ($file)" unless $file.IO.f;
7 my $seqIO = Bio::SeqIO.new(format => 'fasta', file => $file);
8
9 my $j = -1 * ($k - 1);
10 while (my $seq = $seqIO.next-Seq) {
11 put $seq.seq.comb.rotor($k => $j).map(*.join).join("\n");
12 }
13 }
It's probably easiest to understand this by going into the REPL:
> my $s = "ACGTACGT";
ACGTACGT
> $s.comb
(A C G T A C G T)
> $s.comb.rotor: 3
((A C G) (T A C))
> $s.comb.rotor: 3 => -2
((A C G) (C G T) (G T A) (T A C) (A C G) (C G T))
> $s.comb.rotor(3 => -2).map(*.join)
(ACG CGT GTA TAC ACG CGT)
> $s.comb.rotor(3 => -2).map(*.join).join("\n")
ACG
CGT
GTA
TAC
ACG
CGT
We've seen before how we can use comb
to explode a string into a list of characters. Next we call rotor
to break the string into sublists of 3 elements each, but for k-mers we want rotor
to back up two (k - 1) steps before making the next list, so we pass the Pair (3 => -2). We then pass each sublist into a map
where we use the *
to mean "whatever" or "the thing" and call the join
method (with no argument) to create a string. That list of strings then gets join
ed on newlines to print all the k-mers in the sequence.
Non-sequence data
You can use k-mers to determine the similarity of non-sequence data. This program will process all the given files in a pair-wise, all-versus-all fashion to determine if any two files are more similar than the overall similarity of all the files as figured with two-sided Student's t-test. Notice that I can use Unicode characters like "μ" (mu) as variable names:
$ cat -n kmer-counter.pl6
1 #!/usr/bin/env perl6
2
3 subset PosInt of Int where * > 0;
4
5 sub MAIN (PosInt :$k=5, PosInt :$max-sd=5, *@files) {
6 die "No files" unless @files;
7 my @bags = map { find-kmers(+$k, $_) }, @files;
8 my %counts;
9 for ([email protected]).combinations(2) -> ($i, $j) {
10 my $bag1 = @bags[$i-1];
11 my $bag2 = @bags[$j-1];
12 my $s1 = $bag1.Set;
13 my $s2 = $bag2.Set;
14 my @union = ($s1 (&) $s2).keys;
15 my $sum = (map { $bag1{ $_ } }, @union)
16 + (map { $bag2{ $_ } }, @union);
17 %counts{"$i-$j"} = $sum;
18 }
19
20 my @n = %counts.values;
21 my $μ = mean @n;
22 my $sd = std-dev @n;
23 my $d = $sd/(@n.elems).sqrt;
24 for %counts.kv -> $pair, $sum {
25 # https://en.wikipedia.org/wiki/Student%27s_t-test
26 my $t = ($sum - $μ) / $d;
27 if $t.abs > $max-sd {
28 my ($i, $j) = $pair.split('-');
29 my $f1 = @files[$i-1].IO.basename;
30 my $f2 = @files[$j-1].IO.basename;
31 put "$pair ($sum) = $t [$f1, $f2]";
32 }
33 }
34 }
35
36 sub find-kmers (Int $k, Str $file) {
37 my $text = $file.IO.lines.lc.join(' ')
38 .subst(/:i <-[a..z\s]>/, '', :g).subst(/\s+/, ' ');
39 $text.comb.rotor($k => -1 * ($k - 1)).map(*.join).Bag;
40 }
41
42 sub mean (*@n) { @n.sum / @n.elems }
43
44 sub std-dev (*@n) {
45 # https://en.wikipedia.org/wiki/Standard_deviation
46 my $mean = mean(@n);
47 my @dev = map { ($_ - $mean)² }, @n;
48 my $var = @dev.sum / @dev.elems;
49 return $var.sqrt;
50 }
Here is the result of running this program on the homework submissions for a class of mine:
1-9 (74) = -6.36237878762838 [s01.pl6, s09.pl6]
1-4 (60) = -8.97194821224158 [s01.pl6, s04.pl6]
5-6 (136) = 5.19428580708723 [s05.pl6, s06.pl6]
3-7 (208) = 18.6149285622408 [s03.pl6, s07.pl6]
2-4 (76) = -5.98958315554078 [s02.pl6, s04.pl6]
6-8 (142) = 6.31267270335003 [s06.pl6, s08.pl6]
4-10 (74) = -6.36237878762838 [s04.pl6, s10.pl6]
7-8 (178) = 13.0229940809268 [s07.pl6, s08.pl6]
3-10 (208) = 18.6149285622408 [s03.pl6, s10.pl6]
3-8 (142) = 6.31267270335003 [s03.pl6, s08.pl6]
6-9 (78) = -5.61678752345318 [s06.pl6, s09.pl6]
4-5 (66) = -7.85356131597878 [s04.pl6, s05.pl6]
7-10 (198) = 16.7509504018028 [s07.pl6, s10.pl6]
4-6 (68) = -7.48076568389118 [s04.pl6, s06.pl6]
4-9 (72) = -6.73517441971598 [s04.pl6, s09.pl6]
5-8 (148) = 7.43105959961283 [s05.pl6, s08.pl6]
2-9 (80) = -5.24399189136558 [s02.pl6, s09.pl6]
The amount of similarity is fairly high, which is not unexpected from running this on Perl code; however, a few of these are really too high to be explained by chance. The extremely high ones can be isolated:
$ ./kmer-counter.pl6 --max-sd=16 ~/homework/*.pl6
3-7 (208) = 18.6149285622408 [s03.pl6, s07.pl6]
3-10 (208) = 18.6149285622408 [s03.pl6, s10.pl6]
7-10 (198) = 16.7509504018028 [s07.pl6, s10.pl6]
And, indeed, I found that "s03.pl6" was submitted with minor changes by two other students as "s07.pl6" and "s10.pl6" (names changed to protect the innocent, of course).