In my previous write-ups of Part 1 and Part 2, I traced the Perl code and examples in the first half of the book, Beginning Perl for Bioinformatics, by James Tisdall, highlighting different approaches to bioinformatics in Perl. As I mentioned before, Perl provides many different (and often stylistic) methods to solving a software problem. The different methods usually differ in execution speed, code size, code scalability, readability / maintainability, simplicity, and advanced Perl symantics. Since this is a beginning text, the advanced Perl isn’t covered.. that means templates, which could be useful for parsing bioinformatics data, are one of the topics not included here.
Often, the fastest code is the smallest code, and contains subtle code tricks for optimization. This is a perfect setup, because, in Chapter 8, Tisdall starts parsing FASTA files. With Perl’s parsing engine, the subtly of the tricks leaves a lot of room for optimizing software.
FASTA & Sequence Translation
Tisdall offers a software problem based on the FASTA data, so time to solve it:
Tisdall: When you try to print the “raw” sequence data, it can be a problem if the data is much longer than the width of the page. For most practical purposes, 80 characters is about the maximum length you should try to fit across a page. Let’s write a print_sequence subroutine that takes as its arguments some sequence and a line length and prints out the sequence, breaking it up into lines of that length.
Compare his solution to mine:
# Solution by Tisdall # print_sequence # # A subroutine to format and print sequence data sub print_sequence { my($sequence, $length) = @_; use strict; use warnings; # Print sequence in lines of $length for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) { print substr($sequence, $pos, $length), "\n"; } }
The above is a straightforward, strings-based approach. I chose a regex approach, which took a couple minutes to work out, though should be faster during run-time:
sub dna_print { my $str = $_[0]; do { $str =~ s/^([\w]{0,25})//; print "$1\n"; } until (!length($str)); }
The above relies on the following method:
Copy the string to a temporary value, then loop over it’s contents, replacing the first 25 characters with null, then print the match.
My method has the drawback that the line length is fixed in software; I set it to 25. Although it should be possible to create self-modifying code (generate the code as a string, and evaluate the string as code, to recurse through the string data), I didn’t get that to work in under 10 minutes, so it’s a problem for another day.
Tisdall gets mad props for publishing the run-time output of his example scripts, allowing readers to diff custom solution output against his output. My output matches his, so my implementation should be bug-free.
His next problem is to print the translated proteins instead of printing the DNA. Refer to the central dogma. The script should read the FASTA DNA, translate, and print the resultant protein. My entire solution is below. Write your own & compare.
# Solution by JonathanCline #!/usr/bin/perl Main(@ARGV); exit; sub codon2aa { my $codon = $_[0]; $codon =~ tr/a-z/A-Z/; if (defined $TRANSLATE{$codon}) { return $TRANSLATE{$codon}; } else { warn "can't translate ".$codon; } return; } sub fasta_read { my $filenamein = $_[0]; open (IN, "<".$filenamein) || die "cant open $filenamein"; $_ = ; chomp; if (/^>(.*)/) { my $header = $_; } else { close (IN); die "not fasta format: $filenamein"; } my $seq; while () { chomp; next if /^\s*$/; next if /^\s*#/; s/\s*//g; $seq .= $_; } return $seq; } sub dna_print { my $str = $_[0]; do { $str =~ s/^([\w]{0,25})//; print "$1\n"; } until (!length($str)); } sub Main { # Read translation data from __DATA__ while () { chomp; last if (/__END__/); if (/^([\w_]+)\s([\w_]+)/) { $TRANSLATE{$1} = $2; } else { die "parse error $_"; } } my $codon; my $protein; my $dna = fasta_read("dna5.fasta.txt"); # Translate each three-base codon into an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); $protein .= codon2aa($codon); } print "DNA:\n"; dna_print($dna); print "Protein:\n"; dna_print($protein); return; } __DATA__ TCA S # Serine TCC S # Serine TCG S # Serine TCT S # Serine TTC F # Phenylalanine TTT F # Phenylalanine TTA L # Leucine TTG L # Leucine TAC Y # Tyrosine TAT Y # Tyrosine TAA _ # Stop TAG _ # Stop TGC C # Cysteine TGT C # Cysteine TGA _ # Stop TGG W # Tryptophan CTA L # Leucine CTC L # Leucine CTG L # Leucine CTT L # Leucine CCA P # Proline CCC P # Proline CCG P # Proline CCT P # Proline CAC H # Histidine CAT H # Histidine CAA Q # Glutamine CAG Q # Glutamine CGA R # Arginine CGC R # Arginine CGG R # Arginine CGT R # Arginine ATA I # Isoleucine ATC I # Isoleucine ATT I # Isoleucine ATG M # Methionine ACA T # Threonine ACC T # Threonine ACG T # Threonine ACT T # Threonine AAC N # Asparagine AAT N # Asparagine AAA K # Lysine AAG K # Lysine AGC S # Serine AGT S # Serine AGA R # Arginine AGG R # Arginine GTA V # Valine GTC V # Valine GTG V # Valine GTT V # Valine GCA A # Alanine GCC A # Alanine GCG A # Alanine GCT A # Alanine GAC D # Aspartic Acid GAT D # Aspartic Acid GAA E # Glutamic Acid GAG E # Glutamic Acid GGA G # Glycine GGC G # Glycine GGG G # Glycine GGT G # Glycine __END__
The input file:
> sample dna | (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc acacctgagccactctcagatgaggaccta
The script output matches the solution. (Generally, a good thing.)
# Output DNA: agatggcggcgctgaggggtcttgg gggctctaggccggccacctactgg tttgcagcggagacgacgcatgggg cctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagt agttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgc aggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcgga gatatggagggagatggttcagacc cagagcctccagatgccggggagga cagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgca aaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtgg ttccatggggactgcatccggatca ctgagaagatggccaaggccatccg ggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattc gctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagc agtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatcc agacctgcagcgccgggcagggtca gggacaggggttggggccatgcttg ctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggcc acacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagc ccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtc actgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatc cggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgta caagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgc caaggccccgccggccactgcccac ccaacagcagccacagccatcacag aagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagt caaggagcctcctgaggctacagcc acacctgagccactctcagatgagg accta Protein: RWRR_GVLGALGRPPTGLQRRRRMG PAQ_EYAAWEA_LEAEVVVGAFATA WDAAEWSVQVRGSLAGVVRECAGSG DMEGDGSDPEPPDAGEDSKSENGEN APIYCICRKPDINCFMIGCDNCNEW FHGDCIRITEKMAKAIREWYCRECR EKDPKLEIRYRHKKSRERDGNERDS SEPRDEGGGRKRPVPDPDLQRRAGS GTGVGAMLARGSASPHKSSPQPLVA TPSQHHQQQQQQIKRSARMCGECEA CRRTEDCGHCDFCRDMKKFGGPNKI RQKCRLRQCQLRARESYKYFPSSLS PVTPSESLPRPRRPLPTQQQPQPSQ KLGRIREDEGAVASSTVKEPPEATA TPEPLSDEDL
A very important simplification to these problems has been made: the reading frame is assumed to be aligned at the start of the data. Reference wikipedia (not the ideal source, I know; though, it’s a freely quotable source):
In biology, a reading frame is a contiguous and non-overlapping set of three-nucleotide codons in DNA or RNA. There are 3 possible reading frames in a mRNA strand and six in a double stranded DNA molecule due to the two strands from which transcription is possible. This leads to the possibility of overlapping genes and there may be many of these in bacteria.[1] Some viruses e.g. HBV and BYDV use several overlapping genes in different reading frames.
In rare cases a translating ribosome may shift from one frame to another, a translational frameshift. It is distinct from a frameshift mutation as the nucleotide sequence (DNA or RNA) is not altered only the frame in which it is read.
An open reading frame (ORF) is a reading frame that contains a start codon, the subsequent region which usually has a length which is a multiple of 3 nucleotides, and end with a stop codon.
Given the fact that translation might not occur at the start of the sequence data, Tisdall poses the problem of translating all 6 reading frames: 1, 2, and 3 are presumably from the given strand and offset by 0, 1, and 2 characters, respectively; the frames 4, 5, and 6, are presumably from the given strand’s compliment, and offset by 0, 1, and 2 characters, respectively. This means creating the compliment subroutine, and using offsets in the translation. I will use Tisdall’s subroutine names to reduce confusion.. and include the minor modifications made to the original script above. The solution is basically identical to Tisdall’s:
# Solution by JonathanCline # Reverse compliment DNA strand # $_[0] = dna sequence string # returns: dna sequence string sub revcom { my $strand1 = $_[0]; my $strand2; $strand2 = reverse $strand1; $strand2 =~ tr/ACGTacgt/TGCAtgca/; return $strand2; } # Translate each three-base codon into an amino acid, and append to a protein # $_[0] = dna sequence string # $_[1] = offset in string for reading frame sub dna2peptide { my $dna = $_[0]; my $offset = $_[1]; my $pos; my $codon; my $protein; for(my $pos = $offset; $pos < (length($dna) - 2) ; $pos += 3) { $codon = substr($dna, $pos, 3); $protein .= codon2aa($codon); } return $protein; } sub Main { # Read translation data from __DATA__ while () { chomp; last if (/__END__/); if (/^([\w_]+)\s([\w_]+)/) { $TRANSLATE{$1} = $2; } else { die "parse error $_"; } } my $codon; my $protein; my $dna = fasta_read("dna5.fasta.txt"); print "DNA:\n"; dna_print($dna); for $frame (0 .. 2) { $protein = dna2peptide($dna, $frame); print "Protein (Frame $frame):\n"; dna_print($protein); } # Do reverse compliments my $dna2 = revcom($dna); for $frame (0 .. 2) { $protein = dna2peptide($dna2, $frame); print "Protein (Frame ".($frame + 4)."):\n"; dna_print($protein); } return; }
That concludes Chapter 8, which means: onto some interesting exercise problems.
Tisdall: Exercise 8.1: Write a subroutine that checks a string and returns true if it’s a DNA sequence. Write another that checks for protein sequence data.
Here’s my solution. Seems to work.
# Solution by JonathanCline #!/usr/bin/perl &Main(@ARGV); exit 0; sub verify_dna { my $dna = $_[0]; $dna =~ s/[ACGT]+//gix; $dna =~ s/\s*//gix; if (length($dna) > 0) { return false; } return true; } sub Main { @data = ; if (verify_dna(join("", @data)) eq true) { print "DNA!\n"; } else { print "not DNA.\n"; } return; }
I believe this verifies for proteins by changing the main regex line above, to :
$dna =~ s/[SFLY_CWPHQRIMTNKVDEG]+//gix;
The later exercises are more challenging, and I’ll save them for a rainy and highly-paid day:
Tisdall: For each codon, make note of what effect single nucleotide mutations have on the codon: does the same amino acid result, or does the codon now encode a different amino acid? Which one? Write a subroutine that, given a codon, returns a list of all the amino acids that may result from any single mutation in the codon.
Write a subroutine that, given an amino acid, randomly changes it to one of the amino acids calculated in the previous subroutine. Write a program that randomly mutates the amino acids in a protein but restricts the possibilities to those that can occur due to a single mutation in the original codons, as in the previous subroutines. Some codons are more likely than others to occur in random DNA. For instance, there are 6 of the 64 possible codons that code for the amino acid serine, but only 2 of the 64 codes for phenylalanine.
Write a subroutine that, given an amino acid, returns the probability that it’s coded by a randomly generated codon.
Write a program that, given two amino acids, returns the probability that a single mutation in their underlying (but unspecified) codons results in the codon of one amino acid mutating to the codon of the other amino acid.
Restriction Enzymes & REBASE Format Translation
At this point, the author introduces more complex regular expressions, and suggests the reader look at the perlre documentation. I’m sure that looking at the regex docs is bound to scare most biologists, kind of like memorizing taxonomy scares most computer scientists.
First, some biology background. (Quoted from wikipedia)
A restriction enzyme (or restriction endonuclease) is an enzyme that cuts double-stranded DNA at specific recognition nucleotide sequences known as restriction sites. Such enzymes, found in bacteria and archaea, are thought to have evolved to provide a defense mechanism against invading viruses. Inside a bacterial host, the restriction enzymes selectively cut up foreign DNA in a process called restriction; host DNA is methylated by a modification enzyme (a methylase) to protect it from the restriction enzyme’s activity. Collectively, these two processes form the restriction modification system. To cut the DNA, a restriction enzyme makes two incisions, once through each sugar-phosphate backbone (i.e. each strand) of the DNA double helix.
A restriction map is a map of known restriction sites within a sequence of DNA. Restriction mapping requires the use of restriction enzymes. In molecular biology, restriction maps—along with DNA-DNA hybridization, and DNA or RNA sequence analysis—are used to determine the genetic relationships between two or more subjects, often different species, at the molecular level.
For readers from a non-biology background, Tisdall sums this up very simply: restriction enzymes cut the sequence of DNA at specific string sequences: “HindIII cuts at AAGCTT". Thus these could form regular expressions for processing DNA sequence strings. Based on the data at REBASE, http://rebase.neb.com/rebase/rebase.html the author suggests collecting restriction enzyme data for conversion into regular expressions for matching. Of course, this book was written several years ago, so his example is using bionet.104 from 2001, while the current version is
REBASE version 811 bionet.811
This is a nice reminder that biological data is being updated rapidly. (In fact, the URL in the book was no longer valid. Use google as necessary.)
Tisdall: The plan is to write a program that includes restriction enzyme data translated into regular expressions, stored as the values of the keys of the restriction enzyme names. DNA sequence data will be used from the file, and the user will be prompted for names of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and we’ll search for all instances of that regular expression, plus their locations. Finally, the list of locations found will be returned.
In his program, Tisdall beats my solution in two nifty ways. My solution is below.
# Solution by JonathanCline #!/usr/bin/perl $Rebase_filename = "rebase.bionet.811.txt"; &Main(@ARGV); exit; # Read rebase 'bionet' file into memory sub rebase_read { my @data; open(FILEIN, "<".$Rebase_filename) || die "cant read file"; my $start = 0; while () { if (/^Rich /) { $start = 1; next; } next if (!$start); next if (/^\s/); push(@data, $_); } close FILEIN; return @data; } # Parse text lines of rebase 'bionet' file format into hash # # These are the IUB ambiguity codes # (Eur. J. Biochem. 150: 1-5, 1985): # R = G or A # Y = C or T # M = A or C # K = G or T # S = G or C # W = A or T # B = not A (C or G or T) # D = not C (A or G or T) # H = not G (A or C or T) # V = not T (A or C or G) # N = A or C or G or T sub rebase_parse_to_regex { my %names; my $name; my $seq; my @lines = @_; foreach $line (@lines) { $line =~ s/[\n^]//g; # Remove \n and ^ prior to match $name = ""; # Match: BtuI (ClaI) ATCGAT if ($line =~ m/^(\w+)\s+\((\w+)\)\s+(\w+)$/) { $name = $1; $seq = $3; # $sites{$2} = $3. " "; # Ignore these } # Match: Cac8I GCNNNGC if ($line =~ m/^(\w+)\s+(\w+)$/) { $name = $1; $seq = $2; } if ($name eq "") { next; } # Post-process the sequence: replace IUB code with regex $seq =~ s/R/[GA]/gix; $seq =~ s/Y/[CT]/gix; $seq =~ s/M/[AC]/gix; $seq =~ s/K/[GT]/gix; $seq =~ s/S/[GC]/gix; $seq =~ s/W/[AT]/gix; $seq =~ s/B/[CGT]/gix; $seq =~ s/D/[AGT]/gix; $seq =~ s/H/[ACT]/gix; $seq =~ s/V/[ACG]/gix; $seq =~ s/N/[ACGT]/gix; $names{$name} = $seq; } return %names; } sub Main { @rebase_raw = rebase_read(); %rebase = rebase_parse_to_regex(@rebase_raw); foreach $key (keys %rebase) { print "$key $rebase{$key}\n"; } } __END__
Tisdall uses two techniques which I glossed over:
- To ignore header data from the file, I use a temp variable and loop over the lines until the header lines are passed over. Tisdall uses a non-variable regular expression: ( 1 .. /Rich Roberts/ ) and next; which automatically discards the header lines. Alternatively, the solution could use a multi-line regex to match & discard the header.
- To parse each line of data, I use a regex with stored expressions, with two expressions necessary for the two types of data lines found in the file format. Tisdall takes a much simpler approach, by splitting the line into an arbitrary number of fields based on whitespace, and then shifting the left side and popping the right side. This saves the two desired fields (the left-most column of data and right-most column of data) without requiring a regex or getting deep into the details of data line format (shown below). A much simpler approach.
# Solution by Tisdall # Get and store the name and the recognition site $name = shift @fields; $site = pop @fields;
The next part of the code searches for the regular expression in the given FASTA file’s DNA sequence, using the previously written fasta subroutine.
# Solution by JonathanCline sub fasta_read { my $filenamein = $_[0]; open (IN, "<".$filenamein) || die "cant open $filenamein"; $_ = ; chomp; if (/^>(.*)/) { my $header = $_; } else { close (IN); die "not fasta format: $filenamein"; } my $seq; while () { chomp; next if /^\s*$/; next if /^\s*#/; s/\s*//g; $seq .= $_; } return $seq; } sub rebase_match { my $strand = $_[0]; my $regex = $_[1]; my $index, @positions, $found; # Recall that regex has 1 or 2 fields; split out the re my @args = split(/\t/, $regex); my $re = pop @args; do { $strand =~ s/(^.*?)($re)//i; $found = 0; if ($1) { $index += length($1) + length($2); push(@positions, $index - length($2) + 1); $found = 1; } } until (!$found); return (@positions); } sub Main { my $sitename = $_[0]; if (!($sitename =~ /\w+/)) { die "No such restriction site $sitename\n"; } @rebase_raw = rebase_read($Rebase_filename); %rebase = rebase_parse_to_regex(@rebase_raw); my $dna = fasta_read("dna5.fasta.txt"); $site_regex = $rebase{$sitename} || die "No match for $sitename\n"; print "$site_regex\n"; my @sites = rebase_match($dna, $site_regex); if (@sites) { print "Found at position: ".join (" ", @sites)."\n"; } else { print "Not found.\n"; } }
My solution ended up to be a little more complex than Tisdall’s. Although running it gives the same results:
$ perl jc-example9-3.pl AceI TseI GC[AT]GC Found at position: 54 94 582 660 696 702 840 855 957 $ perl jc-example9-3.pl AaeI BamHI GGATCC Not found. $ perl jc-example9-3.pl AccII FnuDII CGCG Found at position: 181
There was a little fiddling with the math after the regex in order to get the same string index results. I used the same approach as before: match the regex in the string, replacing the match with null, until there is nothing left of the string or until there is no match. This requires some math to keep track of the index position of the original string while the string becomes continuously shortened. In contrast, Tisdall uses the m// operator with pos( ) which doesn’t modify the string and internally tracks the math using the built-in variable $& . Also of note, is my use of the non-greedy form of regex, which is not mentioned in the text: in order to match the smallest subset of sequence data preceeding the restriction site regex, I use s/(^.*?)($re)//i; with the question mark indicating non-greedy matching. The book doesn’t cover the concept of greedy matching (unless I skimmed over it).
Parsing GenBank Files
The author points out the following links for exploration regarding bioinformatics data and software.
- National Center for Biotechnology Information (NCBI)
- European Bioinformatics Institute (EBI)
- European Molecular Biology Laboratory (EMBL)
- GenBank file format ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
I have found the following to be useful as well.
- bioinformatics.org Wiki
- open-bio.org Open Bioinformatics Foundation
- NCBI Free Bookshelf http://www.ncbi.nlm.nih.gov/sites/entrez?db=books
Now back to GenBank. The author cites an example of GenBank data (presumably to be used in later exercises).
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ACCESSION AB031069 VERSION AB031069.1 GI:8100074 KEYWORDS . SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (sites) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and Takano,T. TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain, is regulated by proteolysis JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) MEDLINE 20261256 REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. Tadahiro Fujino, Keio University School of Medicine, Department of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan (E-mail:fujino@microb.med.keio.ac.jp, Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075" /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR" BASE COUNT 564 a 715 c 768 g 440 t ORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat 301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac 361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc 421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat 481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat 541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca 601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg 661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg 721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt 781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg 841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca 901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag 961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca 1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta 1081 cctctggatc ctgacctgta tcaggacttc tgtgcagggg cctttgatga ccatggcctg 1141 ccctggatga gcgacacaga agagtcccca ttcctggacc ccgcgctgcg gaagagggca 1201 gtgaaagtga agcatgtgaa gcgtcgggag aagaagtctg agaagaagaa ggaggagcga 1261 tacaagcggc atcggcagaa gcagaagcac aaggataaat ggaaacaccc agagagggct 1321 gatgccaagg accctgcgtc actgccccag tgcctggggc ccggctgtgt gcgccccgcc 1381 cagcccagct ccaagtattg ctcagatgac tgtggcatga agctggcagc caaccgcatc 1441 tacgagatcc tcccccagcg catccagcag tggcagcaga gcccttgcat tgctgaagag 1501 cacggcaaga agctgctcga acgcattcgc cgagagcagc agagtgcccg cactcgcctt 1561 caggaaatgg aacgccgatt ccatgagctt gaggccatca ttctacgtgc caagcagcag 1621 gctgtgcgcg aggatgagga gagcaacgag ggtgacagtg atgacacaga cctgcagatc 1681 ttctgtgttt cctgtgggca ccccatcaac ccacgtgttg ccttgcgcca catggagcgc 1741 tgctacgcca agtatgagag ccagacgtcc tttgggtcca tgtaccccac acgcattgaa 1801 ggggccacac gactcttctg tgatgtgtat aatcctcaga gcaaaacata ctgtaagcgg 1861 ctccaggtgc tgtgccccga gcactcacgg gaccccaaag tgccagctga cgaggtatgc 1921 gggtgccccc ttgtacgtga tgtctttgag ctcacgggtg acttctgccg cctgcccaag 1981 cgccagtgca atcgccatta ctgctgggag aagctgcggc gtgcggaagt ggacttggag 2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc aggagcgcaa tgtgcgcaca 2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat 2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt 2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat 2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag 2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt 2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa 2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa //
To parse this file into annotation sections (LOCUS to ORIGIN) and sequence data (ORIGIN to end of record), Tisdall shows a few different approaches, the last being the most compact after a discussion of multi-line regex and record separators:
# Solution by Tisdall # Set input separator to "//\n" and read in a record to a scalar $/ = "//\n"; $record = ; # reset input separator $/ = $save_input_separator; # Now separate the annotation from the sequence data ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
Instead, I used a longer approach with a state machine, parsing through lines in the file. Since the genbank data files are large, this should may or may not be faster, depending on the differences in memory and drive speed between reading in entire records into memory and accessing sequential lines from disk. Tisdall also doesn’t mention accession numbers until the next exercise, though I chose to save the data into a hash based on accession number and data type. He also doesn’t mention state machines — which is probably good, because that topic tends to bog down introductory texts — instead he later mentions using flags to track the portion of the file being processed.
# Solution by JonathanCline #!/usr/bin/perl $Genbank_filename = "record.gb.txt"; &Main(@ARGV); exit; # Print dna sequence string, limit max line length to N chars sub dna_print { my $str = $_[0]; do { $str =~ s/^([\w]{0,70})//; print "$1\n"; } until (!length($str)); } # Read genbank file into hash sub genbank_read { my %data; open(FILEIN, "<".$_[0]) || die "cant read file"; $state = 0; while () { next if (/^\s*$/); # skip blank lines if (m/^LOCUS\s*(\w+)\s*/i) { # start of annotation $state = "a"; $num = $1; # accession number print "Parsing $num\n"; } elsif (/^FEATURES/i) { # start of feature data $state = "f"; print "Feature data\n"; next; } elsif (/^ORIGIN/i) { # start of sequence data $state = "s"; print "Seq data\n"; next; } elsif (m!^//!) { # end of record $state = 0; $num = 0; next; } next if (!$state); if ($state eq "a") { # append data to annotation $data{$num . "-an"} .= $_; } if ($state eq "f") { # append data to features $data{$num . "-f"} .= $_; } elsif ($state eq "s") { # append data to sequence /^\s*(\d+)\s+(.*)$/; $data{$num . "-seq"} .= $2; $data{$num . "-seq"} =~ s/\s*//g; } } close FILEIN; return %data; } sub Main { %genbank_raw = genbank_read($Genbank_filename); print %genbank_raw; my $acc = "AB031069"; # append data to sequence /^\s*(\d+)\s+(.*)$/; $data{$num . "-seq"} .= $2; $data{$num . "-seq"} =~ s/\s*//g; } } close FILEIN; return %data; } sub Main { %genbank_raw = genbank_read($Genbank_filename); print %genbank_raw; my $acc = "AB031069"; my $dna = $genbank_raw{$acc."-seq"} || die "Accession not found.\n"; dna_print($dna); } __END__
Later examples show further parsing based on section headings. In which case, the following question is very important and a frequent mistake of beginning-level programmers:
How can this software be made future-proof, so that changes in the database will automatically be parsable by the program?
This is a very important concept which separates the low level programming techs from the higher level R&D types — thinking ahead to future issues with the underlying programming interfaces. One method to approach this particular issue for genbank data, assuming the future database upgrades would add major sections differentiated by /^FIELDNAME/ , is to parse the data into the hash directly based on this name. Then, higher-level software can make simple changes to access the new field data. Some quick modifications change my previous fixed-text state machine into an arbitrary-text state machine, where each state is derived from the database data automatically:
# Solution by JonathanCline # Read genbank file into hash sub genbank_read { my %data; open(FILEIN, "<".$_[0]) || die "cant read file"; $state = 0; while () { next if (/^\s*$/); # skip blank lines if (m/^(LOCUS)\s*(\w+)\s*/i) { # start of annotation $state = $1; $num = $2; # accession number } elsif (/^(\w+)\s+/i) { # start of other data $state = $1; next; } elsif (m!^//!) { # end of record $state = 0; $num = 0; next; } next if (!$state); if ($state ne "ORIGIN") { # append data to section $data{$num . "-" . $state} .= $_; } else { # append data to sequence /^\s*(\d+)\s+(.*)$/; next if !$2; my $strand = $2; $strand =~ s/\s*//g; $data{$num . "-" . $state} .= $strand; } } close FILEIN; return %data; }
The result of the above is that each section is parsed directly into a hash with a key such as AB031069-ORIGIN for sequence data, or AB031069-REFERENCE for reference section data, and so on. Of course, I don’t know if this is an appropriate choice.. it depends on the users of the program (the biologists) — so normally the perl hacker would ask them first.
This blog entry has become pretty long. So it looks like it’s time to break into Part 4.
I am currently a student using the Tisdall text for my bioinformatics Perl programming class and have found your blog very useful. I was looking for parts 4 and 5 of your review can you point me to them? I would greatly appreciate it!
Thanks,
Leslie
Leslie –
They’re in my archives, so I won’t be reposting them for at least a month or so. You can find Tisdall’s solutions online though, I think.