Sequence similarity

As our field deals with the sequences of unknown communities of organisms, we are very interested in comparing our samples to known sequences. There are many programs and algorithms for comparing strings, from the venerable BLAST (Basic Local Alignment Search Tool) to hashing to kmers.

K-mers

A "k-mer" is a k-length mer (from Greek meros "part") of contiguous sequence as in the word "polymer." So the given sequence has 3-mers like so:

AGCTTTTC
AGC
 GCT
  CTT
   TTT
    TTT
     TTC

Here's a simple Perl script to get kmers from a sequence:

$ cat -n kmer1.pl6
     1     #!/usr/bin/env perl6
     2
     3     sub MAIN (Str $input!, Int :$k=10) {
     4         if $k < 0 {
     5             note "k ($k) must be positive\n";
     6             exit 1;
     7         }
     8
     9         my $seq = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
    10         my $n   = $seq.chars - $k + 1;
    11
    12         if $n < 1 {
    13             note "Cannot extract {$k}-mers from seq length {$seq.chars}";
    14             exit 1;
    15         }
    16
    17         for 0..^$n -> $i {
    18             put $seq.substr($i, $k);
    19         }
    20     }

In lines 4-7, we first want to ensure that we have a positive value for k, so we need to explore this idea briefly. Another way to write this would be to move that condition into the signature:

$ cat -n pos1.pl6
     1     #!/usr/bin/env perl6
     2
     3     sub MAIN (Int :$k where * > 0 = 10) {
     4         put "k = $k";
     5     }
$ ./pos1.pl6 -k=5
k = 5
$ ./pos1.pl6 -k=-1
Usage:
  ./pos1.pl6 [-k=<Int>]

Because that sort of thing is so useful, Perl let's us create our own types as a subset of an existing type. I'll arbitrarily limit k to a positive integer less than 50:

$ cat -n pos2.pl6
     1     #!/usr/bin/env perl6
     2
     3     subset SmallPosInt of Int where 0 < * < 50 ;
     4     sub MAIN (SmallPosInt :$k=10) {
     5         put "k = $k";
     6     }
$ ./pos2.pl6 -k=5
k = 5
$ ./pos2.pl6 -k=-1
Usage:
  ./pos2.pl6 [-k=<Int>]
$ ./pos2.pl6 -k=51
Usage:
  ./pos2.pl6 [-k=<Int>]

As it happens, there is a built-in type that will constrain us to a positive integer called UInt for "unsigned integer":

$ cat -n pos3.pl6
     1     #!/usr/bin/env perl6
     2
     3     sub MAIN (UInt :$k=10) {
     4         put "k = $k";
     5     }
$ ./pos3.pl6 -k=5
k = 5
$ ./pos3.pl6 -k=-1
Usage:
  ./pos3.pl6 [-k=<Int>]

Now we can shorten our code and get type checking for free:

$ cat -n kmer2.pl6
     1     #!/usr/bin/env perl6
     2
     3     sub MAIN (Str $input!, UInt :$k=10) {
     4         my $seq = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
     5         my $n   = $seq.chars - $k + 1;
     6
     7         if $n < 1 {
     8             note "Cannot extract {$k}-mers from seq length {$seq.chars}";
     9             exit 1;
    10         }
    11
    12         for 0..^$n -> $i {
    13             put $seq.substr($i, $k);
    14         }
    15     }
$ ./kmer2.pl6 -k=20 input.txt
AGCTTTTCATTCTGACTGCA
GCTTTTCATTCTGACTGCAA
CTTTTCATTCTGACTGCAAC
TTTTCATTCTGACTGCAACG
TTTCATTCTGACTGCAACGG
TTCATTCTGACTGCAACGGG

At line 5, use a simple formula to determine the number of k-mers we can extract from the given sequence. At lines 7-10, we decide if we can continue based on the input from the user. Lines 12-14 should look somewhat familiar by now. Since we're going to use the zero-based substr to extract part of the sequence, we need to use the "..^" range operator to go up to but not including our value for $n. After that, we just put the extracted k-mer.

Here's a slightly shorter version that again uses the map function (https://docs.perl6.org/routine/map) to eschew a for loop:

$ cat -n kmer3.pl6
     1     #!/usr/bin/env perl6
     2
     3     sub MAIN (Str $input!, UInt :$k=10) {
     4         my $seq = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
     5         my $n   = $seq.chars - $k + 1;
     6
     7         if $n < 1 {
     8             note "Cannot extract {$k}-mers from seq length {$seq.chars}";
     9             exit 1;
    10         }
    11
    12         put join "\n", map { $seq.substr($_, $k) }, 0..^$n;
    13     }

It's worth spending time to really learn map as it's a very safe and efficient function. It takes two arguments:

a code block to run
a list of things

The code block is applied to each element in the list coming in from the right and is emitted on the left:

new <- map { code } <- original

It's important to remember that a map will always return the same number of elements it was given as opposed to a grep which looks very similar:

> map *.uc, <foo bar baz>
(FOO BAR BAZ)
> grep /ba/, <foo bar baz>
(bar baz)

So the "list" argument to map in line 12 is the range 0..^$n. Each number in turn goes into the code block as the "topic" $_ where it is used in the substr call to extract the k-mer which are then returned as a new list to the join which then puts a newline between them before the call to put.

I'll present one more version that puts the kmers into an array using gather/take (https://docs.perl6.org/language/control#gather/take):

$ cat -n kmer4.pl6
     1    #!/usr/bin/env perl6
     2
     3    sub MAIN (Str $input!, Int :$k=10) {
     4        my $seq   = $input.IO.f ?? $input.IO.slurp.chomp !! $input;
     5        my $n     = $seq.chars - $k + 1;
     6        my @kmers = gather for 0..^$n -> $i {
     7            take $seq.substr($i, $k);
     8        }
     9
    10        dd @kmers;
    11    }

The dd is the built-in "data dumper" that shows you a textual representation of a data structure:

$ ./kmer4.pl6 input.txt
Array @kmers = ["AGCTTTTCAT", "GCTTTTCATT",
"CTTTTCATTC", "TTTTCATTCT", "TTTCATTCTG",
"TTCATTCTGA", "TCATTCTGAC", "CATTCTGACT",
"ATTCTGACTG", "TTCTGACTGC", "TCTGACTGCA",
"CTGACTGCAA", "TGACTGCAAC", "GACTGCAACG",
"ACTGCAACGG", "CTGCAACGGG"]

Kmers from FASTA

Now let's combine our parsing of FASTA with extraction of k-mers:

$ cat -n fasta-kmer1.pl6
     1     #!/usr/bin/env perl6
     2
     3     use Bio::SeqIO;
     4
     5     sub MAIN (Str $file!, UInt :$k=10) {
     6         die "Not a file ($file)" unless $file.IO.f;
     7         my $seqIO = Bio::SeqIO.new(format => 'fasta', file => $file);
     8
     9         while (my $seq = $seqIO.next-Seq) {
    10             put join "\n", get-kmers($k, $seq.seq);
    11         }
    12     }
    13
    14     sub get-kmers(Int $k, Str $str) {
    15         my $n = $str.chars - $k + 1;
    16         map { $str.substr($_, $k) }, 0..^$n;
    17     }

I copied the FASTA script from the previous chapter and introduced a function called get-kmers. You've already been writing functions, of course, like MAIN and calling functions like join and put, but now we're calling a function we've written. We define that it takes two positional arguments, an integer $k and a string $string. Right away the type checking caught an error as I tried to pass in the $seq object and not the $seq.seq string of the sequence.

Just like with our MAIN, we can change our positional arguments into named arguments by placing a : before the names (line 14 below):

$ cat -n fasta-kmer2.pl6
     1     #!/usr/bin/env perl6
     2
     3     use Bio::SeqIO;
     4
     5     sub MAIN (Str $file! where *.IO.f, UInt :$k=10) {
     6         my $seqIO = Bio::SeqIO.new(format => 'fasta', file => $file);
     7
     8         while (my $seq = $seqIO.next-Seq) {
     9             put join "\n", get-kmers(str => $seq.seq, k => $k);
    10         }
    11     }
    12
    13     sub get-kmers(Int :$k, Str :$str) {
    14         my $n = $str.chars - $k + 1;
    15         map { $str.substr($_, $k) }, 0..^$n;
    16     }

This means we have to call get-kmers with Pairs for arguments (line 10), but it also means the arguments can be specified in any order. Generally, if a function takes more than 2 or 3 arguments, I would recommend using named arguments.

Something else you might have noticed in the above version is that I moved the check of the file-ness of $input into a constraint in the MAIN signature. If you needed this contraint more than once, it would behoove you to create a new subset.

Here is another version that again dips into the function programming world and uses a new string function called rotor (http://perl6.party/post/Perl-6-.rotor-The-King-of-List-Manipulation):

$ cat -n fasta-kmer3.pl6
     1     #!/usr/bin/env perl6
     2
     3     use Bio::SeqIO;
     4
     5     sub MAIN (Str $file!, UInt :$k=10) {
     6         die "Not a file ($file)" unless $file.IO.f;
     7         my $seqIO = Bio::SeqIO.new(format => 'fasta', file => $file);
     8
     9         my $j = -1 * ($k - 1);
    10         while (my $seq = $seqIO.next-Seq) {
    11             put $seq.seq.comb.rotor($k => $j).map(*.join).join("\n");
    12         }
    13     }

It's probably easiest to understand this by going into the REPL:

> my $s = "ACGTACGT";
ACGTACGT
> $s.comb
(A C G T A C G T)
> $s.comb.rotor: 3
((A C G) (T A C))
> $s.comb.rotor: 3 => -2
((A C G) (C G T) (G T A) (T A C) (A C G) (C G T))
> $s.comb.rotor(3 => -2).map(*.join)
(ACG CGT GTA TAC ACG CGT)
> $s.comb.rotor(3 => -2).map(*.join).join("\n")
ACG
CGT
GTA
TAC
ACG
CGT

We've seen before how we can use comb to explode a string into a list of characters. Next we call rotor to break the string into sublists of 3 elements each, but for k-mers we want rotor to back up two (k - 1) steps before making the next list, so we pass the Pair (3 => -2). We then pass each sublist into a map where we use the * to mean "whatever" or "the thing" and call the join method (with no argument) to create a string. That list of strings then gets joined on newlines to print all the k-mers in the sequence.

Non-sequence data

You can use k-mers to determine the similarity of non-sequence data. This program will process all the given files in a pair-wise, all-versus-all fashion to determine if any two files are more similar than the overall similarity of all the files as figured with two-sided Student's t-test. Notice that I can use Unicode characters like "μ" (mu) as variable names:

$ cat -n kmer-counter.pl6
     1    #!/usr/bin/env perl6
     2    
     3    subset PosInt of Int where * > 0;
     4    
     5    sub MAIN (PosInt :$k=5, PosInt :$max-sd=5, *@files) {
     6        die "No files" unless @files;
     7        my @bags = map { find-kmers(+$k, $_) }, @files;
     8        my %counts;
     9        for ([email protected]).combinations(2) -> ($i, $j) {
    10            my $bag1  = @bags[$i-1];
    11            my $bag2  = @bags[$j-1];
    12            my $s1    = $bag1.Set;
    13            my $s2    = $bag2.Set;
    14            my @union = ($s1 (&) $s2).keys;
    15            my $sum   = (map { $bag1{ $_ } }, @union)
    16                      + (map { $bag2{ $_ } }, @union);
    17            %counts{"$i-$j"} = $sum;
    18        }
    19    
    20        my @n  = %counts.values;
    21        my $μ  = mean @n;
    22        my $sd = std-dev @n;
    23        my $d  = $sd/(@n.elems).sqrt;
    24        for %counts.kv -> $pair, $sum {
    25            # https://en.wikipedia.org/wiki/Student%27s_t-test
    26            my $t = ($sum - $μ) / $d;
    27            if $t.abs > $max-sd {
    28                my ($i, $j) = $pair.split('-');
    29                my $f1 = @files[$i-1].IO.basename;
    30                my $f2 = @files[$j-1].IO.basename;
    31                put "$pair ($sum) = $t [$f1, $f2]";
    32            }
    33        }
    34    }
    35    
    36    sub find-kmers (Int $k, Str $file) {
    37        my $text = $file.IO.lines.lc.join(' ')
    38                   .subst(/:i <-[a..z\s]>/, '', :g).subst(/\s+/, ' ');
    39        $text.comb.rotor($k => -1 * ($k - 1)).map(*.join).Bag;
    40    }
    41    
    42    sub mean (*@n) { @n.sum / @n.elems }
    43    
    44    sub std-dev (*@n) {
    45        # https://en.wikipedia.org/wiki/Standard_deviation
    46        my $mean = mean(@n);
    47        my @dev  = map { ($_ - $mean)² }, @n;
    48        my $var  = @dev.sum / @dev.elems;
    49        return $var.sqrt;
    50    }

Here is the result of running this program on the homework submissions for a class of mine:

1-9 (74) = -6.36237878762838 [s01.pl6, s09.pl6]
1-4 (60) = -8.97194821224158 [s01.pl6, s04.pl6]
5-6 (136) = 5.19428580708723 [s05.pl6, s06.pl6]
3-7 (208) = 18.6149285622408 [s03.pl6, s07.pl6]
2-4 (76) = -5.98958315554078 [s02.pl6, s04.pl6]
6-8 (142) = 6.31267270335003 [s06.pl6, s08.pl6]
4-10 (74) = -6.36237878762838 [s04.pl6, s10.pl6]
7-8 (178) = 13.0229940809268 [s07.pl6, s08.pl6]
3-10 (208) = 18.6149285622408 [s03.pl6, s10.pl6]
3-8 (142) = 6.31267270335003 [s03.pl6, s08.pl6]
6-9 (78) = -5.61678752345318 [s06.pl6, s09.pl6]
4-5 (66) = -7.85356131597878 [s04.pl6, s05.pl6]
7-10 (198) = 16.7509504018028 [s07.pl6, s10.pl6]
4-6 (68) = -7.48076568389118 [s04.pl6, s06.pl6]
4-9 (72) = -6.73517441971598 [s04.pl6, s09.pl6]
5-8 (148) = 7.43105959961283 [s05.pl6, s08.pl6]
2-9 (80) = -5.24399189136558 [s02.pl6, s09.pl6]

The amount of similarity is fairly high, which is not unexpected from running this on Perl code; however, a few of these are really too high to be explained by chance. The extremely high ones can be isolated:

$ ./kmer-counter.pl6 --max-sd=16 ~/homework/*.pl6
3-7 (208) = 18.6149285622408 [s03.pl6, s07.pl6]
3-10 (208) = 18.6149285622408 [s03.pl6, s10.pl6]
7-10 (198) = 16.7509504018028 [s07.pl6, s10.pl6]

And, indeed, I found that "s03.pl6" was submitted with minor changes by two other students as "s07.pl6" and "s10.pl6" (names changed to protect the innocent, of course).

Sequence similarity

Sequence similarity

K-mers

Kmers from FASTA

Non-sequence data

results matching ""

No results matching ""