In-Depth Review, Part 3 of 5: “Beginning Perl for Bioinformatics” by James Tisdall

In my previous write-ups of Part 1 and Part 2, I traced the Perl code and examples in the first half of the book, Beginning Perl for Bioinformatics, by James Tisdall, highlighting different approaches to bioinformatics in Perl. As I mentioned before, Perl provides many different (and often stylistic) methods to solving a software problem. The different methods usually differ in execution speed, code size, code scalability, readability / maintainability, simplicity, and advanced Perl symantics. Since this is a beginning text, the advanced Perl isn’t covered.. that means templates, which could be useful for parsing bioinformatics data, are one of the topics not included here.

Often, the fastest code is the smallest code, and contains subtle code tricks for optimization. This is a perfect setup, because, in Chapter 8, Tisdall starts parsing FASTA files. With Perl’s parsing engine, the subtly of the tricks leaves a lot of room for optimizing software.

FASTA & Sequence Translation

Tisdall offers a software problem based on the FASTA data, so time to solve it:

Tisdall: When you try to print the “raw” sequence data, it can be a problem if the data is much longer than the width of the page. For most practical purposes, 80 characters is about the maximum length you should try to fit across a page. Let’s write a print_sequence subroutine that takes as its arguments some sequence and a line length and prints out the sequence, breaking it up into lines of that length.

Compare his solution to mine:

# Solution by Tisdall
# print_sequence
#
# A subroutine to format and print sequence data

sub print_sequence {

    my($sequence, $length) = @_;

    use strict;
    use warnings;

    # Print sequence in lines of $length
    for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {
        print substr($sequence, $pos, $length), "\n";
    }
}

The above is a straightforward, strings-based approach. I chose a regex approach, which took a couple minutes to work out, though should be faster during run-time:

sub dna_print {
  my $str = $_[0];
  do {
    $str =~ s/^([\w]{0,25})//;
    print "$1\n";
  } until (!length($str));
}

The above relies on the following method:

Copy the string to a temporary value, then loop over it’s contents, replacing the first 25 characters with null, then print the match.

My method has the drawback that the line length is fixed in software; I set it to 25. Although it should be possible to create self-modifying code (generate the code as a string, and evaluate the string as code, to recurse through the string data), I didn’t get that to work in under 10 minutes, so it’s a problem for another day.

Tisdall gets mad props for publishing the run-time output of his example scripts, allowing readers to diff custom solution output against his output. My output matches his, so my implementation should be bug-free.

His next problem is to print the translated proteins instead of printing the DNA. Refer to the central dogma. The script should read the FASTA DNA, translate, and print the resultant protein. My entire solution is below. Write your own & compare.

# Solution by JonathanCline
#!/usr/bin/perl

Main(@ARGV);
exit;

sub codon2aa {
  my $codon = $_[0];
  $codon =~ tr/a-z/A-Z/;
  if (defined $TRANSLATE{$codon}) {
    return $TRANSLATE{$codon};
  }
  else {
    warn "can't translate ".$codon;
  }
  return;
}

sub fasta_read {
  my $filenamein = $_[0];
  open (IN, "<".$filenamein) || die "cant open $filenamein";
  $_ = ;
  chomp;
  if (/^>(.*)/) {
    my $header = $_;

  }
  else {
    close (IN);
    die "not fasta format: $filenamein";
  }
  my $seq;
  while () {
    chomp;
    next if /^\s*$/;
    next if /^\s*#/;
    s/\s*//g;
    $seq .= $_;
  }
  return $seq;
}

sub dna_print {
  my $str = $_[0];
  do {
    $str =~ s/^([\w]{0,25})//;
    print "$1\n";
  } until (!length($str));
}

sub Main {
  # Read translation data from __DATA__
  while () {
    chomp;
    last if (/__END__/);
    if (/^([\w_]+)\s([\w_]+)/) {
      $TRANSLATE{$1} = $2;
    }
    else { die "parse error $_"; }
  }

  my $codon;
  my $protein;
  my $dna = fasta_read("dna5.fasta.txt");

# Translate each three-base codon into an amino acid, and append to a protein
  for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
      $codon = substr($dna,$i,3);
      $protein .= codon2aa($codon);
  }

  print "DNA:\n";
  dna_print($dna);
  print "Protein:\n";
  dna_print($protein);

  return;
}

__DATA__
TCA S # Serine
TCC S # Serine
TCG S # Serine
TCT S # Serine
TTC F # Phenylalanine
TTT F # Phenylalanine
TTA L # Leucine
TTG L # Leucine
TAC Y # Tyrosine
TAT Y # Tyrosine
TAA _ # Stop
TAG _ # Stop
TGC C # Cysteine
TGT C # Cysteine
TGA _ # Stop
TGG W # Tryptophan
CTA L # Leucine
CTC L # Leucine
CTG L # Leucine
CTT L # Leucine
CCA P # Proline
CCC P # Proline
CCG P # Proline
CCT P # Proline
CAC H # Histidine
CAT H # Histidine
CAA Q # Glutamine
CAG Q # Glutamine
CGA R # Arginine
CGC R # Arginine
CGG R # Arginine
CGT R # Arginine
ATA I # Isoleucine
ATC I # Isoleucine
ATT I # Isoleucine
ATG M # Methionine
ACA T # Threonine
ACC T # Threonine
ACG T # Threonine
ACT T # Threonine
AAC N # Asparagine
AAT N # Asparagine
AAA K # Lysine
AAG K # Lysine
AGC S # Serine
AGT S # Serine
AGA R # Arginine
AGG R # Arginine
GTA V # Valine
GTC V # Valine
GTG V # Valine
GTT V # Valine
GCA A # Alanine
GCC A # Alanine
GCG A # Alanine
GCT A # Alanine
GAC D # Aspartic Acid
GAT D # Aspartic Acid
GAA E # Glutamic Acid
GAG E # Glutamic Acid
GGA G # Glycine
GGC G # Glycine
GGG G # Glycine
GGT G # Glycine
__END__

The input file:

> sample dna | (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta

The script output matches the solution. (Generally, a good thing.)

# Output
DNA:
agatggcggcgctgaggggtcttgg
gggctctaggccggccacctactgg
tttgcagcggagacgacgcatgggg
cctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagt
agttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgc
aggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcgga
gatatggagggagatggttcagacc
cagagcctccagatgccggggagga
cagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgca
aaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtgg
ttccatggggactgcatccggatca
ctgagaagatggccaaggccatccg
ggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattc
gctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagc
agtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatcc
agacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttg
ctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggcc
acacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagc
ccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtc
actgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatc
cggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgta
caagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgc
caaggccccgccggccactgcccac
ccaacagcagccacagccatcacag
aagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagt
caaggagcctcctgaggctacagcc
acacctgagccactctcagatgagg
accta
Protein:
RWRR_GVLGALGRPPTGLQRRRRMG
PAQ_EYAAWEA_LEAEVVVGAFATA
WDAAEWSVQVRGSLAGVVRECAGSG
DMEGDGSDPEPPDAGEDSKSENGEN
APIYCICRKPDINCFMIGCDNCNEW
FHGDCIRITEKMAKAIREWYCRECR
EKDPKLEIRYRHKKSRERDGNERDS
SEPRDEGGGRKRPVPDPDLQRRAGS
GTGVGAMLARGSASPHKSSPQPLVA
TPSQHHQQQQQQIKRSARMCGECEA
CRRTEDCGHCDFCRDMKKFGGPNKI
RQKCRLRQCQLRARESYKYFPSSLS
PVTPSESLPRPRRPLPTQQQPQPSQ
KLGRIREDEGAVASSTVKEPPEATA
TPEPLSDEDL

A very important simplification to these problems has been made: the reading frame is assumed to be aligned at the start of the data. Reference wikipedia (not the ideal source, I know; though, it’s a freely quotable source):

In biology, a reading frame is a contiguous and non-overlapping set of three-nucleotide codons in DNA or RNA. There are 3 possible reading frames in a mRNA strand and six in a double stranded DNA molecule due to the two strands from which transcription is possible. This leads to the possibility of overlapping genes and there may be many of these in bacteria.^[1] Some viruses e.g. HBV and BYDV use several overlapping genes in different reading frames.

In rare cases a translating ribosome may shift from one frame to another, a translational frameshift. It is distinct from a frameshift mutation as the nucleotide sequence (DNA or RNA) is not altered only the frame in which it is read.

An open reading frame (ORF) is a reading frame that contains a start codon, the subsequent region which usually has a length which is a multiple of 3 nucleotides, and end with a stop codon.

Given the fact that translation might not occur at the start of the sequence data, Tisdall poses the problem of translating all 6 reading frames: 1, 2, and 3 are presumably from the given strand and offset by 0, 1, and 2 characters, respectively; the frames 4, 5, and 6, are presumably from the given strand’s compliment, and offset by 0, 1, and 2 characters, respectively. This means creating the compliment subroutine, and using offsets in the translation. I will use Tisdall’s subroutine names to reduce confusion.. and include the minor modifications made to the original script above. The solution is basically identical to Tisdall’s:

# Solution by JonathanCline

# Reverse compliment DNA strand
#  $_[0] = dna sequence string
#  returns: dna sequence string
sub revcom {
  my $strand1 = $_[0];
  my $strand2;
  $strand2 = reverse $strand1;
  $strand2 =~ tr/ACGTacgt/TGCAtgca/;
  return $strand2;
}

# Translate each three-base codon into an amino acid, and append to a protein
#   $_[0] = dna sequence string
#   $_[1] = offset in string for reading frame
sub dna2peptide {
  my $dna = $_[0];
  my $offset = $_[1];
  my $pos;
  my $codon;
  my $protein;
  for(my $pos = $offset; $pos < (length($dna) - 2) ; $pos += 3) {
      $codon = substr($dna, $pos, 3);
      $protein .= codon2aa($codon);
  }
  return $protein;
}

sub Main {
  # Read translation data from __DATA__
  while () {
    chomp;
    last if (/__END__/);
    if (/^([\w_]+)\s([\w_]+)/) {
      $TRANSLATE{$1} = $2;
    }
    else { die "parse error $_"; }
  }

  my $codon;
  my $protein;
  my $dna = fasta_read("dna5.fasta.txt");

  print "DNA:\n";
  dna_print($dna);
  for $frame (0 .. 2) {
    $protein = dna2peptide($dna, $frame);
    print "Protein (Frame $frame):\n";
    dna_print($protein);
  }

  # Do reverse compliments
  my $dna2 = revcom($dna);
  for $frame (0 .. 2) {
    $protein = dna2peptide($dna2, $frame);
    print "Protein (Frame ".($frame + 4)."):\n";
    dna_print($protein);
  }

  return;
}

That concludes Chapter 8, which means: onto some interesting exercise problems.

Tisdall: Exercise 8.1: Write a subroutine that checks a string and returns true if it’s a DNA sequence. Write another that checks for protein sequence data.

Here’s my solution. Seems to work.

# Solution by JonathanCline
#!/usr/bin/perl

&Main(@ARGV);
exit 0;

sub verify_dna {
  my $dna = $_[0];
  $dna =~ s/[ACGT]+//gix;
  $dna =~ s/\s*//gix;
  if (length($dna) > 0) {
    return false;
  }
  return true;

}

sub Main {
  @data = ;
  if (verify_dna(join("", @data)) eq true) {
    print "DNA!\n";
  }
  else {
    print "not DNA.\n";
  }
  return;
}

I believe this verifies for proteins by changing the main regex line above, to :

$dna =~ s/[SFLY_CWPHQRIMTNKVDEG]+//gix;

The later exercises are more challenging, and I’ll save them for a rainy and highly-paid day:

Tisdall: For each codon, make note of what effect single nucleotide mutations have on the codon: does the same amino acid result, or does the codon now encode a different amino acid? Which one? Write a subroutine that, given a codon, returns a list of all the amino acids that may result from any single mutation in the codon.

Write a subroutine that, given an amino acid, randomly changes it to one of the amino acids calculated in the previous subroutine. Write a program that randomly mutates the amino acids in a protein but restricts the possibilities to those that can occur due to a single mutation in the original codons, as in the previous subroutines. Some codons are more likely than others to occur in random DNA. For instance, there are 6 of the 64 possible codons that code for the amino acid serine, but only 2 of the 64 codes for phenylalanine.

Write a subroutine that, given an amino acid, returns the probability that it’s coded by a randomly generated codon.

Write a program that, given two amino acids, returns the probability that a single mutation in their underlying (but unspecified) codons results in the codon of one amino acid mutating to the codon of the other amino acid.

Restriction Enzymes & REBASE Format Translation

At this point, the author introduces more complex regular expressions, and suggests the reader look at the perlre documentation. I’m sure that looking at the regex docs is bound to scare most biologists, kind of like memorizing taxonomy scares most computer scientists.

First, some biology background. (Quoted from wikipedia)

A restriction enzyme (or restriction endonuclease) is an enzyme that cuts double-stranded DNA at specific recognition nucleotide sequences known as restriction sites. Such enzymes, found in bacteria and archaea, are thought to have evolved to provide a defense mechanism against invading viruses. Inside a bacterial host, the restriction enzymes selectively cut up foreign DNA in a process called restriction; host DNA is methylated by a modification enzyme (a methylase) to protect it from the restriction enzyme’s activity. Collectively, these two processes form the restriction modification system. To cut the DNA, a restriction enzyme makes two incisions, once through each sugar-phosphate backbone (i.e. each strand) of the DNA double helix.

A restriction map is a map of known restriction sites within a sequence of DNA. Restriction mapping requires the use of restriction enzymes. In molecular biology, restriction maps—along with DNA-DNA hybridization, and DNA or RNA sequence analysis—are used to determine the genetic relationships between two or more subjects, often different species, at the molecular level.

For readers from a non-biology background, Tisdall sums this up very simply: restriction enzymes cut the sequence of DNA at specific string sequences: “HindIII cuts at AAGCTT". Thus these could form regular expressions for processing DNA sequence strings. Based on the data at REBASE, http://rebase.neb.com/rebase/rebase.html the author suggests collecting restriction enzyme data for conversion into regular expressions for matching. Of course, this book was written several years ago, so his example is using bionet.104 from 2001, while the current version is

REBASE version 811                                              bionet.811

This is a nice reminder that biological data is being updated rapidly. (In fact, the URL in the book was no longer valid. Use google as necessary.)

Tisdall: The plan is to write a program that includes restriction enzyme data translated into regular expressions, stored as the values of the keys of the restriction enzyme names. DNA sequence data will be used from the file, and the user will be prompted for names of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and we’ll search for all instances of that regular expression, plus their locations. Finally, the list of locations found will be returned.

In his program, Tisdall beats my solution in two nifty ways. My solution is below.

# Solution by JonathanCline
#!/usr/bin/perl

$Rebase_filename = "rebase.bionet.811.txt";
&Main(@ARGV);
exit;

# Read rebase 'bionet' file into memory
sub rebase_read {
  my @data;
  open(FILEIN, "<".$Rebase_filename) || die "cant read file";
  my $start = 0;
  while () {
    if (/^Rich /) { $start = 1; next; }
    next if (!$start);
    next if (/^\s/);
    push(@data, $_);
  }
  close FILEIN;
  return @data;
}

# Parse text lines of rebase 'bionet' file format into hash
#
# These are the IUB ambiguity codes
# (Eur. J. Biochem. 150: 1-5, 1985):
# R = G or A
# Y = C or T
# M = A or C
# K = G or T
# S = G or C
# W = A or T
# B = not A (C or G or T)
# D = not C (A or G or T)
# H = not G (A or C or T)
# V = not T (A or C or G)
# N = A or C or G or T

sub rebase_parse_to_regex {
  my %names;
  my $name;
  my $seq;
  my @lines = @_;
  foreach $line (@lines) {
    $line =~ s/[\n^]//g;   # Remove \n and ^ prior to match
    $name = "";
    # Match:  BtuI (ClaI)   ATCGAT
    if ($line =~ m/^(\w+)\s+\((\w+)\)\s+(\w+)$/) {
      $name = $1;
      $seq = $3;
      # $sites{$2} = $3. " ";  # Ignore these
    }
    # Match: Cac8I   GCNNNGC
    if ($line =~ m/^(\w+)\s+(\w+)$/) {
      $name = $1;
      $seq = $2;
    }
    if ($name eq "") { next; }
    # Post-process the sequence: replace IUB code with regex
    $seq =~ s/R/[GA]/gix;
    $seq =~ s/Y/[CT]/gix;
    $seq =~ s/M/[AC]/gix;
    $seq =~ s/K/[GT]/gix;
    $seq =~ s/S/[GC]/gix;
    $seq =~ s/W/[AT]/gix;
    $seq =~ s/B/[CGT]/gix;
    $seq =~ s/D/[AGT]/gix;
    $seq =~ s/H/[ACT]/gix;
    $seq =~ s/V/[ACG]/gix;
    $seq =~ s/N/[ACGT]/gix;
    $names{$name} = $seq;
  }
  return %names;
}

sub Main {
  @rebase_raw = rebase_read();
  %rebase = rebase_parse_to_regex(@rebase_raw);
  foreach $key (keys %rebase) {
    print "$key $rebase{$key}\n";
  }
}

__END__

Tisdall uses two techniques which I glossed over:

To ignore header data from the file, I use a temp variable and loop over the lines until the header lines are passed over. Tisdall uses a non-variable regular expression: ( 1 .. /Rich Roberts/ ) and next; which automatically discards the header lines. Alternatively, the solution could use a multi-line regex to match & discard the header.
To parse each line of data, I use a regex with stored expressions, with two expressions necessary for the two types of data lines found in the file format. Tisdall takes a much simpler approach, by splitting the line into an arbitrary number of fields based on whitespace, and then shifting the left side and popping the right side. This saves the two desired fields (the left-most column of data and right-most column of data) without requiring a regex or getting deep into the details of data line format (shown below). A much simpler approach.

# Solution by Tisdall
    # Get and store the name and the recognition site
    $name = shift @fields;
    $site = pop @fields;

The next part of the code searches for the regular expression in the given FASTA file’s DNA sequence, using the previously written fasta subroutine.

# Solution by JonathanCline
sub fasta_read {
  my $filenamein = $_[0];
  open (IN, "<".$filenamein) || die "cant open $filenamein";
  $_ = ;
  chomp;
  if (/^>(.*)/) {
    my $header = $_;
  }
  else {
    close (IN);
    die "not fasta format: $filenamein";
  }
  my $seq;
  while () {
    chomp;
    next if /^\s*$/;
    next if /^\s*#/;
    s/\s*//g;
    $seq .= $_;
  }
  return $seq;
}

sub rebase_match {
  my $strand = $_[0];
  my $regex = $_[1];
  my $index, @positions, $found;
  # Recall that regex has 1 or 2 fields; split out the re
  my @args = split(/\t/, $regex);
  my $re = pop @args;
  do {
    $strand =~ s/(^.*?)($re)//i;
    $found = 0;
    if ($1) {
      $index += length($1) + length($2);
      push(@positions, $index - length($2) + 1);
      $found = 1;
    }
  } until (!$found);
  return (@positions);
}

sub Main {
  my $sitename = $_[0];
  if (!($sitename =~ /\w+/)) {
    die "No such restriction site $sitename\n";
  }

  @rebase_raw = rebase_read($Rebase_filename);
  %rebase = rebase_parse_to_regex(@rebase_raw);

  my $dna = fasta_read("dna5.fasta.txt");
  $site_regex = $rebase{$sitename} || die "No match for $sitename\n";
  print "$site_regex\n";
  my @sites = rebase_match($dna, $site_regex);
  if (@sites) {
    print "Found at position: ".join (" ", @sites)."\n";
  }
  else {
    print "Not found.\n";
  }
}

My solution ended up to be a little more complex than Tisdall’s. Although running it gives the same results:

$ perl jc-example9-3.pl AceI
TseI	GC[AT]GC
Found at position: 54 94 582 660 696 702 840 855 957
$ perl jc-example9-3.pl AaeI
BamHI	GGATCC
Not found.
$ perl jc-example9-3.pl AccII
FnuDII	CGCG
Found at position: 181

There was a little fiddling with the math after the regex in order to get the same string index results. I used the same approach as before: match the regex in the string, replacing the match with null, until there is nothing left of the string or until there is no match. This requires some math to keep track of the index position of the original string while the string becomes continuously shortened. In contrast, Tisdall uses the m// operator with pos( ) which doesn’t modify the string and internally tracks the math using the built-in variable $& . Also of note, is my use of the non-greedy form of regex, which is not mentioned in the text: in order to match the smallest subset of sequence data preceeding the restriction site regex, I use s/(^.*?)($re)//i; with the question mark indicating non-greedy matching. The book doesn’t cover the concept of greedy matching (unless I skimmed over it).

Parsing GenBank Files

The author points out the following links for exploration regarding bioinformatics data and software.

National Center for Biotechnology Information (NCBI)
European Bioinformatics Institute (EBI)
European Molecular Biology Laboratory (EMBL)
GenBank file format ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

I have found the following to be useful as well.

bioinformatics.org Wiki
open-bio.org Open Bioinformatics Foundation
NCBI Free Bookshelf http://www.ncbi.nlm.nih.gov/sites/entrez?db=books

Now back to GenBank. The author cites an example of GenBank data (presumably to be used in later exercises).

LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
DEFINITION  Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
            complete cds.
ACCESSION   AB031069
VERSION     AB031069.1  GI:8100074
KEYWORDS    .
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
            mRNA.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (sites)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
            Takano,T.
  TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
            is regulated by proteolysis
  JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
  MEDLINE   20261256
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
            Takano,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            (E-mail:fujino@microb.med.keio.ac.jp,
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /sex="male"
                     /cell_line="HuS-L12"
                     /cell_type="lung fibroblast"
                     /dev_stage="embryo"
     gene            229..2199
                     /gene="PCCX1"
     CDS             229..2199
                     /gene="PCCX1"
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     domain"
                     /codon_start=1
                     /product="protein containing CXXC domain 1"
                     /protein_id="BAA96307.1"
                     /db_xref="GI:8100075"
                     /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                     RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                     QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                     FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                     EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                     KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                     FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                     YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                     PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                     AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BASE COUNT      564 a    715 c    768 g    440 t
ORIGIN
        1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
       61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
      121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
      181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
      241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
      301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
      361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
      421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
      481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
      541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca
      601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg
      661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg
      721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt
      781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg
      841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca
      901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag
      961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca
     1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta
     1081 cctctggatc ctgacctgta tcaggacttc tgtgcagggg cctttgatga ccatggcctg
     1141 ccctggatga gcgacacaga agagtcccca ttcctggacc ccgcgctgcg gaagagggca
     1201 gtgaaagtga agcatgtgaa gcgtcgggag aagaagtctg agaagaagaa ggaggagcga
     1261 tacaagcggc atcggcagaa gcagaagcac aaggataaat ggaaacaccc agagagggct
     1321 gatgccaagg accctgcgtc actgccccag tgcctggggc ccggctgtgt gcgccccgcc
     1381 cagcccagct ccaagtattg ctcagatgac tgtggcatga agctggcagc caaccgcatc
     1441 tacgagatcc tcccccagcg catccagcag tggcagcaga gcccttgcat tgctgaagag
     1501 cacggcaaga agctgctcga acgcattcgc cgagagcagc agagtgcccg cactcgcctt
     1561 caggaaatgg aacgccgatt ccatgagctt gaggccatca ttctacgtgc caagcagcag
     1621 gctgtgcgcg aggatgagga gagcaacgag ggtgacagtg atgacacaga cctgcagatc
     1681 ttctgtgttt cctgtgggca ccccatcaac ccacgtgttg ccttgcgcca catggagcgc
     1741 tgctacgcca agtatgagag ccagacgtcc tttgggtcca tgtaccccac acgcattgaa
     1801 ggggccacac gactcttctg tgatgtgtat aatcctcaga gcaaaacata ctgtaagcgg
     1861 ctccaggtgc tgtgccccga gcactcacgg gaccccaaag tgccagctga cgaggtatgc
     1921 gggtgccccc ttgtacgtga tgtctttgag ctcacgggtg acttctgccg cctgcccaag
     1981 cgccagtgca atcgccatta ctgctgggag aagctgcggc gtgcggaagt ggacttggag
     2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc aggagcgcaa tgtgcgcaca
     2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
     2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt
     2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat
     2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag
     2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
     2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
     2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

To parse this file into annotation sections (LOCUS to ORIGIN) and sequence data (ORIGIN to end of record), Tisdall shows a few different approaches, the last being the most compact after a discussion of multi-line regex and record separators:

# Solution by Tisdall
# Set input separator to "//\n" and read in a record to a scalar
$/ = "//\n";

$record = ;

# reset input separator
$/ = $save_input_separator;

# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

Instead, I used a longer approach with a state machine, parsing through lines in the file. Since the genbank data files are large, this should may or may not be faster, depending on the differences in memory and drive speed between reading in entire records into memory and accessing sequential lines from disk. Tisdall also doesn’t mention accession numbers until the next exercise, though I chose to save the data into a hash based on accession number and data type. He also doesn’t mention state machines — which is probably good, because that topic tends to bog down introductory texts — instead he later mentions using flags to track the portion of the file being processed.

# Solution by JonathanCline
#!/usr/bin/perl

$Genbank_filename = "record.gb.txt";
&Main(@ARGV);
exit;

# Print dna sequence string, limit max line length to N chars
sub dna_print {
  my $str = $_[0];
  do {
    $str =~ s/^([\w]{0,70})//;
    print "$1\n";
  } until (!length($str));
}

# Read genbank file into hash
sub genbank_read {
  my %data;
  open(FILEIN, "<".$_[0]) || die "cant read file";
  $state = 0;
  while () {
    next if (/^\s*$/);  # skip blank lines
    if (m/^LOCUS\s*(\w+)\s*/i) {
      # start of annotation
      $state = "a";
      $num = $1;  # accession number
      print "Parsing $num\n";
    }
    elsif (/^FEATURES/i) {
      # start of feature data
      $state = "f";
      print "Feature data\n";
      next;
    }
    elsif (/^ORIGIN/i) {
      # start of sequence data
      $state = "s";
      print "Seq data\n";
      next;
    }
    elsif (m!^//!) {
      # end of record
      $state = 0;
      $num = 0;
      next;
    }
    next if (!$state);
    if ($state eq "a") {
     # append data to annotation
      $data{$num . "-an"} .= $_;
    }
    if ($state eq "f") {
      # append data to features
      $data{$num . "-f"} .= $_;
    }
    elsif ($state eq "s") {
      # append data to sequence
      /^\s*(\d+)\s+(.*)$/;
      $data{$num . "-seq"} .= $2;
      $data{$num . "-seq"} =~ s/\s*//g;
    }
  }
  close FILEIN;
  return %data;
}

sub Main {

  %genbank_raw = genbank_read($Genbank_filename);
  print %genbank_raw;
  my $acc = "AB031069";
      # append data to sequence
      /^\s*(\d+)\s+(.*)$/;
      $data{$num . "-seq"} .= $2;
      $data{$num . "-seq"} =~ s/\s*//g;
    }
  }
  close FILEIN;
  return %data;
}

sub Main {

  %genbank_raw = genbank_read($Genbank_filename);
  print %genbank_raw;
  my $acc = "AB031069";
  my $dna = $genbank_raw{$acc."-seq"} || die "Accession not found.\n";

  dna_print($dna);
}

__END__

Later examples show further parsing based on section headings. In which case, the following question is very important and a frequent mistake of beginning-level programmers:

How can this software be made future-proof, so that changes in the database will automatically be parsable by the program?

This is a very important concept which separates the low level programming techs from the higher level R&D types — thinking ahead to future issues with the underlying programming interfaces. One method to approach this particular issue for genbank data, assuming the future database upgrades would add major sections differentiated by /^FIELDNAME/ , is to parse the data into the hash directly based on this name. Then, higher-level software can make simple changes to access the new field data. Some quick modifications change my previous fixed-text state machine into an arbitrary-text state machine, where each state is derived from the database data automatically:

# Solution by JonathanCline
# Read genbank file into hash
sub genbank_read {
  my %data;
  open(FILEIN, "<".$_[0]) || die "cant read file";
  $state = 0;
  while () {
    next if (/^\s*$/);  # skip blank lines
    if (m/^(LOCUS)\s*(\w+)\s*/i) {
      # start of annotation
      $state = $1;
      $num = $2;  # accession number
    }
    elsif (/^(\w+)\s+/i) {
      # start of other data
      $state = $1;
      next;
    }
    elsif (m!^//!) {
      # end of record
      $state = 0;
      $num = 0;
      next;
    }
    next if (!$state);
    if ($state ne "ORIGIN") {
      # append data to section
      $data{$num . "-" . $state} .= $_;
    }
    else {
      # append data to sequence
      /^\s*(\d+)\s+(.*)$/;
      next if !$2;
      my $strand = $2;
      $strand =~ s/\s*//g;
      $data{$num . "-" . $state} .= $strand;
    }
  }
  close FILEIN;
  return %data;
}

The result of the above is that each section is parsed directly into a hash with a key such as AB031069-ORIGIN for sequence data, or AB031069-REFERENCE for reference section data, and so on. Of course, I don’t know if this is an appropriate choice.. it depends on the users of the program (the biologists) — so normally the perl hacker would ask them first.

This blog entry has become pretty long. So it looks like it’s time to break into Part 4.

Look for the rest of the detailed review in “Part 4 of 5″.

2 Comments on In-Depth Review, Part 3 of 5: “Beginning Perl for Bioinformatics” by James Tisdall

Leslie says:

March 18, 2009 at 11:19 pm

I am currently a student using the Tisdall text for my bioinformatics Perl programming class and have found your blog very useful. I was looking for parts 4 and 5 of your review can you point me to them? I would greatly appreciate it!

Thanks,
Leslie
JonathanCline says:

March 25, 2009 at 10:46 pm

Leslie –
They’re in my archives, so I won’t be reposting them for at least a month or so. You can find Tisdall’s solutions online though, I think.

88 Proof Synth Bio Blog

Genetically Engineered Organisms, Systems Biology, and Synthetic Biology from an Engineer's Viewpoint