In-Depth Review, Part 1 of 5: “Beginning Perl for Bioinformatics” by James Tisdall

As a specialized field, Bioinformatics is rather young. It can be difficult to find universities which teach bioinformatics. Bioinformatics can refer to many different types of tasks — from using programs and data without any computer science knowledge, to implementing database or web software, to writing data conversion programs which modify file formats between database storage methods, to writing algorithms for modeling and visualizing research problems. Most of the work is described best as “computational biology”.

In the context of Perl (the famous computer language which runs underneath most web pages), Bioinformatics means computing text data retreived from biological databases.

The book, Beginning Perl for Bioinformatics, by James Tisdall, is for learning introductory software techniques in Perl, with a very brief biology review. For biologists who have rarely programmed and need a starting language or need to learn Perl, this is a good place to start. For technologists, note the copyright date on the book, to see how dated the information may be; since bioinformatics is still a young field, standards and technology are evolving rapidly.

Tisdall: “A large part of what you, the Perl bioinformatics programmer, will spend your time doing amounts to variations on the same theme as Examples 4-1 and 4-2. You’ll get some data, be it DNA, proteins, GenBank entries, or what have you; you’ll manipulate the data; and you’ll print out some results.” (Chapter 4)

For software engineers or computer programmers, the biology field is also a completely new realm which is tough to get a handle on, and has it’s own language: Biology as a field (at least to me) has not yet differentiated itself between “soft, life science” and an engineering science. For example, as a software engineer, the most basic software question is, “I need to write a look-up table for these elements, what are the all the possible strings for the field values?” Yet this simple question can be very difficult to answer by consulting a biology textbook. It is important to keep in mind that data manipulation for biology can involve massive amounts of information: also known as, very, very large strings; the strings represent DNA sequences which may range in practical usage from 10k to 100k.

Perl Bioinformatics Introductory Examples

The author states,

Tisdall: “How do you start from scratch and come up with a program that counts the regulatory elements in some DNA? Read on.”

In chapter 4, there are the first simple Perl examples: convert the DNA sequence to the corresponding RNA sequence. In biology, the DNA uses A, T, G, C (representing the chemical names, of course); whereas RNA uses U instead of T. Simple string manipulation provides the answer: s/T/U/g;

The next example is another simple biological problem with practical database-lookup application. Given one strand of DNA (say, the 5′ to 3′ direction), compute the corresponding strand. In later database lookup operations, both may be used to find a match. DNA uses matching pairs of bonds, and is written in reverse (3′ to 5′ direction), so this is a straightforward computational problem. The second strand is reversed and translated: $strand2 = reverse $_; $strand2 =~ tr/ACGT/TGCA/; print “$_\n$strand2\n”;

A few other simple exercises include writing the DNA given the RNA (which is the string operation substituting U to T, s/U/T/g;) and translating between uppercase and lowercase for strings (using tr/// or \U and \L).

The author introduces some simple Perl file & data operations using a protein; the proteins use different character encoding to represent the physical structure. As an example encoding, the human zinc finger protein NM_021964 is given:

MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR

Motifs

The author states,

Tisdall: “One of the most common things we do in bioinformatics is to look for motifs, short segments of DNA or protein that are of particular interest.”

Although as of this writing, I haven’t come across “motif” before. This is a simple example of how the book proves useful: while biologists have probably heard this term in an undergrad life science class, and biology problems might be well understood, for myself, as a technologist, it’s a new term to me, in the complex language of biology. In the software engineering realm, I can now equate “motif” to: “a substring of data” and program a meaningful line of code.

Exercise: Attempt to find the motif SVLQ in the previous human protein and print the resulting match:

# Solution by JonathanCline
if ($protein =~ /$motif/) { print "Found $motif.\n"; } else { print "Not found: $motif.\n"; }

The author uses the “slurp a file” Perl method to read the protein data from a file into an array @protein and/or to a whitespace-filtered string $protein. So the above can be re-written as a loop:

# Solution by JonathanCline
open(IN, "<" . $ARGV[0]);
@lines = <IN>;
close(IN);
$protein = join("", @lines);
$protein =~ s/[\s\W]//g;

The author offers examples of other common bioinformatics operations, as practical examples:

Tisdall: “There are many things you might want to know about a piece of DNA. Is it coding or noncoding? Does it contain a regulatory element? Is it related to some other known DNA, and if so, how? How many of each of the four nucleotides does the DNA contain? In fact, in some species the coding regions have a specific nucleotide bias, so this last question can be important in finding the genes. Also, different species have different patterns of nucleotide usage. So counting nucleotides can be interesting and useful.”

As an aspiring bioinformaticist, take a moment to ponder those questions.. which are easy computational problems if translated into computer science terminology. Nucleotide bias can be found by counting characters in the string. Perl provides many mechanisms for counting. The author implements a “for..” loop and counts each A, C, G, or T encountered, which contains logic and would likely run a little slow. He uses this as an opportunity to explain logic and looping in Perl.

However, Perl has powerful methods (meaning: computationally efficient) for instantly returning count, and these can be explored as well: Perl is fastest at string operations, so one solution is to replace each DNA character and count the resulting string length (my Solution A). Perl offers a fast pack/unpack operation, so another solution is to treat the DNA string as a data encoding, unpack the DNA as “compressed data”, and count the number of bytes (nucleotides) unpacked (my Solution B). Perl can also count replacements directly (my Solution C). And, Perl has a built-in count for array length (my Solution D). I’m sure there are more methods (because Perl excels at offering multiple solutions to any given problem), though these are a few quick ideas after ten minutes of working at it.

Thus, my Solution C is considerably more compact (and fast):

# Solution by JonathanCline
$dna =~ s/[\s\W]//g;   # $dna is "acgtacgtacgt" for example.
$a = ($dna =~ s/A//ig);   # replace all 'a' and return number of replacements.
$c = ($dna =~ s/C//ig);
$g = ($dna =~ s/G//ig);
$t = ($dna =~ s/T//ig);
print "A\tC\tG\tT\n";
print "$a\t$c\t$g\t$t\n";

My Solution A is a little longer and likely a bit slower (but probably still faster than employing logic, as in the author’s example):

# Solution by JonathanCline
$dna =~ s/[\s\W]//g;
$a = $c = $g = $t = $dna;  # make temporary copies; might use a lot of memory.
$a =~ s/[cgt]//ig;  # replace all non-'a' with null, leaving only 'a'
$c =~ s/[agt]//ig;
$g =~ s/[act]//ig;
$t =~ s/[acg]//ig;
print "A\tC\tG\tT\n";
print length($a)."\t".length($c)."\t".length($g)."\t".length($t)."\n";

My Solution D failed, because using split( ) on the DNA string using the nucleotides as seperators in the string also created null field entries (meaning, @a = split(/[cgt]/, $dna) will include nulls, to be equal to (‘a’, ”, ”, ‘a’, ”, …) ), thus the count of the array wasn’t immediately useful, since a correct result would mean using join( ) and length( ) as in Solution A (or pack), with more complexity in the middle.

The author later discusses a similar solution to those I wrote above, using tr/// :


# Solution by Tisdall
$a = ($dna =~ tr/Aa//);
$c = ($dna =~ tr/Cc//);
$g = ($dna =~ tr/Gg//);
$t = ($dna =~ tr/Tt//);


After several sections on how to write subroutines, how to debug, how to process command line arguments, etc, the author offers several more potential exercises.  He poses the following:
Tisdall: “Write a module that contains subroutines that report various statistics on DNA sequences, for instance length, GC content, presence or absence of poly-T sequences (long stretches of mostly T’s at the 5′ (left) end of many $DNA sequences), or other measures of interest.”
As above, the lengths and GC content have already been covered.  Poly-T sequences are a new concept so far: Identify strings of T’s, ranging in length from 5 characters to 12 characters, and so on, in a DNA string.  Specific sequences and rare repeated patterns are important for DNA data analysis.
Mutations and DNA Strand Comparison
Chapter 7 introduces methods for mutating DNA strings using the random function and generating DNA.
Let’s generate some random DNA:
# Solution by JonathanCline
srand(time|$$);
@nt = ('a', 'c', 'g', 't');
$nb = 20;
for (1 .. $nb) {
  push(@dna, $nt[rand(4)]);
}
print join("", @dna)."\n";
Now let’s create a random mutation within that DNA by inserting and/or replacing a nucleotide.  These create different types of errors in the DNA.  However, if mutation is definitely desired (and not just random chance), the code must also check to see that a new, randomly chosen base results in a different base than the current one.
#!/usr/bin/perl
# Solution by JonathanCline
&Main;
exit;

sub mutate_insert {
  my $dna = $_[0];

  my $pos = rand(length($dna));
  printf "insert mutation at index %d\n", $pos;
  $new = $nt[rand(4)];
  my $mutate = substr($dna, 0, $pos);
  $mutate .= $new;
  $mutate .= substr($dna, $pos, length($dna));
  return $mutate;
}
sub mutate_replace {
  my $dna = $_[0];

  my $pos = rand(length($dna));
  printf "replace mutation at index %d\n", $pos;
  my $current = substr($dna, $pos, 1);
  do {
    $new = $nt[rand(4)];
  } while ($current eq $new);
  substr($dna, $pos, 1, $new);
  return $dna;
}

sub Main {
  srand(time|$$);
  @nt = ('a', 'c', 'g', 't');

  $nb = 6;  # Create some DNA with this number of nucleotides
  @dna = ( );
  for (1 .. $nb) {
    push(@dna, $nt[rand(4)]);
  }
  $DNA = join("", @dna);
  print "$DNA\n";
  print mutate_insert($DNA)."\n";
  print mutate_replace($DNA)."\n";
  return;
}
__END__
Analysis of DNA Strands
The next suggested step is to generate random DNA sequences (a randomized string of nucleotide characters) between a randomized minimum and maximum length, and provide analysis.
Tisdall: “You’ll collect some statistics on DNA in order to answer the question: on average, what percentage of bases are the same between two random DNA sequences? Although some simple mathematics can answer the question for you, the point of the program is to show that you now have the necessary programming ability to ask and answer questions about your DNA sequences. (If you were using real DNA, say a collection of some particular gene as it appears in several organisms in slightly different forms, the answer would be somewhat more interesting. You may want to try that later.) So let’s generate a set of random DNA, all the same length, then ask the following question about the set. What’s the average percentage of positions that are the same between pairs of DNA sequences in this set?“
The answer to this must be 25%.  However, given a small set of DNA strands, it could be lower.  My solution is below.  It differs from the author’s.  In fact, if I run it to compare either 100 or 1,000 strands of randomized DNA between themselves, the result is very close to 25% (“1000 strands are 24.990060 % average of alike.”).
#!/usr/bin/perl
# Solution by JonathanCline
use strict;
srand(time|$$);
Main(@ARGV);
exit;

sub rand_nt {
  my @nt = ('a', 't', 'c', 'g');
  return $nt[rand(4)];
}

sub rand_dna {
  my $len = $_[0];
  my $dna;
  while ($len-- > 0) {
    $dna .= rand_nt();
  }
  return $dna;
}

sub calc_same_pos {
  my $dna1 = $_[0];
  my $dna2 = $_[1];
  my $len = length($dna1);
  my $pos;
  my $count;
  if ($len != length($dna2)) { warn 'Unequal length strands'; }
  for $pos (0 .. $len-1) {
    if (substr($dna1, $pos, 1) eq substr($dna2, $pos, 1)) {
        # print "  match '".substr($dna1, $pos, 1)."' at $pos\n";
        $count++;
    }
  }
  # print "$dna1 vs $dna2 -> $count match\n";
  return ($count / $len);
}

sub Main {
  my $min = $_[0] || 10;
  my $max = $_[1] || 20;
  my $numstrands = $_[2] || 10;
  my $dna;
  my $len;
  my @strands;
  my $s2;
  if ($max < $min) { $max = $min * 1.5; }
  for (0 .. $numstrands-1) {
    if ($max != $min) { do { $len = rand($max); } until ($len >= $min); }
    else { $len = $min; }
    $dna = rand_dna($len);
    print "$dna\n";
    push(@strands, $dna);
  }
  for my $strand1 (0 .. $numstrands-1) {
    for my $strand2 (0 .. $numstrands-1) {
      next if ($strand1 == $strand2);
      my $s1 = calc_same_pos($strands[$strand1], $strands[$strand2]);
      # printf "Strands %d, %d are %d %% alike.\n", $strand1, $strand2, $s1*100;
      $s2 = $s2 + $s1;
    }
  }
  printf "$numstrands strands are %f %% average of alike.\n", ($s2 / $numstrands / $numstrands) * 100;
}
__END__
Warning!  Readers are Taught a Poor Programming Practice!
The significant software design error in the author’s code is also a big mistake many computer scientists make.  In order to force the random nucleotide to be different than the current nucleotide, the author uses a loop around his function:
# Solution by Tisdall
sub randomelement {
    my(@array) = @_;
    return $array[rand @array];
}

sub randomnucleotide {
    my(@nucleotides) = ('A', 'C', 'G', 'T');
    return randomelement(@nucleotides);
}

...

do {
$newbase = randomnucleotide(@nucleotides);

# Make sure it's different than the nucleotide we're mutating
} until ( $newbase ne substr($dna, $position,1) );
This is a poor design choice for the following reason.

Always avoid repetitive looping, if at all possible, especially around system calls.  This can also be stated as:  “Always avoid busy loops.”  This is a very frequent mistake for beginning and intermediate programmers.

Notice the method of generating a random nucleotide: it repetitively calls a system call, until the return value is the desired value.  This generates significant system overhead (meaning, it wastes CPU power, and thus takes longer to compute).  A small change in the code can completely eliminate this loop.
(I include a better method in my solutions in Part 2 of this review.)
Generation of Weighted Nucleotide DNA Strands
The obvious question is: how to generate DNA strands with a given nucleotide weighting.  In prior examples, each base (a, c, t, g) was chosen with equal probabilities, so on average, each is represented 25% of the time.  However, most real DNA has weighting.  The author poses this as a more difficult exercise,
Tisdall: “Sometimes not all choices are will be picked in a random selection. Write a subroutine that randomly returns a nucleotide, in which the probability of each nucleotide can be specified. Pass the subroutine four numbers as arguments, representing the probabilities of each nucleotide; if each probability is 0.25, the subroutine is equally likely to pick each nucleotide. As error checking, have the subroutine ensure that the sum of the four probabilities is 1. Hint: one way to accomplish this is to divide the range between 0 and 1 into four intervals with lengths corresponding to the probability of the respective nucleotides. Then, simply pick a random number between 0 and 1, see in which interval it falls, and return the corresponding nucleotide.” 

 
 
 

James Tisdall’s book provides a useful starting introduction to computational biology problems which can be solved in Perl.  For an intermediate or advanced Perl programmer, the entire book can be worked through & the exercises programmed in a single day.   Biologists who don’t know how to program or don’t know how to program in Perl, might take a couple days to a week or more to completely absorb this book.  Interested Perl users should also check out BioPerl which uses object-oriented Perl to implement a large library of programs and methods for computational biology.
I’m sure that biologists will enjoy Perl, and Perl is well suited towards quickly solving the string-related problems found in biological data.  Biology experiments are typically time consuming and finding a result might take hours, days or weeks, whereas, in contrast, writing Perl is very fast and iterative.
There are many topics in bioinformatics which are not covered in Tisdall’s book.  However, as an introduction, it’s worth a read.
Look for the rest of the detailed review in “Part 2 of 5”.

88 Proof Synth Bio Blog

Genetically Engineered Organisms, Systems Biology, and Synthetic Biology from an Engineer's Viewpoint