Category: Technical

In-Depth Review, Part 2 of 5: “Beginning Perl for Bioinformatics” by James Tisdall

Posted by – September 8, 2008

My Part 1 of 5 review of the book, Beginning Perl for Bioinformatics, by James Tisdall, left off at Chapter 8, just before Tisdall explains associative arrays, gene expression, FASTA files, genomic databases, and restriction sites.

Tisdall: “For simplicity, let’s say you have the names for all the genes in the organism and a number for the expressed genes indicating the level of the expression in your experiment; the unexpressed genes have the number 0. Now let’s suppose you want to know if the genes were expressed, but not the expression levels, and you want to solve this programming problem using arrays. After all, you are somewhat familiar with arrays by this point. How do you proceed?”

Perl’s associative arrays are one of the most powerful aspects of the language.  This is a good problem to examine using hashes.  Solutions to this kind of problem in other languages (C or matlab) might create an N-dimensional array (or even NxM) as a matrix representation of the problem.  In C, it might be solved using a lookup table possibly using a linked list, and the code to drive that needs to be written from scratch or borrowed from an external library.  Perl has a built-in method to solve these kinds of problems.

The solution is to use a hash:

$gene_name = "triA";
$level = 10;
$expression_levels{$gene_name} = $level;  # save 'level' on per-gene basis

This leads Tisdall to review biological transcription and translation, including code for DNA->RNA and RNA->protein data conversion.  The code is given in long form and then optimized in further examples for speed using associative arrays.  Recall the central dogma of biology:

More

In-Depth Review, Part 1 of 5: “Beginning Perl for Bioinformatics” by James Tisdall

Posted by – September 6, 2008

As a specialized field, Bioinformatics is rather young.  It can be difficult to find universities which teach bioinformatics.  Bioinformatics can refer to many different types of tasks — from using programs and data without any computer science knowledge, to implementing database or web software, to writing data conversion programs which modify file formats between database storage methods, to writing algorithms for modeling and visualizing research problems.  Most of the work is described best as “computational biology”.

In the context of Perl (the famous computer language which runs underneath most web pages), Bioinformatics means computing text data retreived from biological databases.

The book, Beginning Perl for Bioinformatics, by James Tisdall, is for learning introductory software techniques in Perl, with a very brief biology review.  For biologists who have rarely programmed and need a starting language or need to learn Perl, this is a good place to start.  For technologists, note the copyright date on the book, to see how dated the information may be; since bioinformatics is still a young field, standards and technology are evolving rapidly.

Tisdall: “A large part of what you, the Perl bioinformatics programmer, will spend your time doing amounts to variations on the same theme as Examples 4-1 and 4-2. You’ll get some data, be it DNA, proteins, GenBank entries, or what have you; you’ll manipulate the data; and you’ll print out some results.”   (Chapter 4)

For software engineers or computer programmers, the biology field is also a completely new realm which is tough to get a handle on, and has it’s own language: Biology as a field (at least to me) has not yet differentiated itself between “soft, life science” and an engineering science.  For example, as a software engineer, the most basic software question is, “I need to write a look-up table for these elements, what are the all the possible strings for the field values?”  Yet this simple question can be very difficult to answer by consulting a biology textbook.  It is important to keep in mind that data manipulation for biology can involve massive amounts of information: also known as, very, very large strings; the strings represent DNA sequences which may range in practical usage from 10k to 100k.

Perl Bioinformatics Introductory Examples

The author states,

Tisdall: How do you start from scratch and come up with a program that counts the regulatory elements in some DNA? Read on.”

In chapter 4, there are the first simple Perl examples:  convert the DNA sequence to the corresponding RNA sequence.  In biology, the DNA uses A, T, G, C (representing the chemical names, of course); whereas RNA uses U instead of T.  Simple string manipulation provides the answer:  s/T/U/g;

More

The World’s Smallest Organisms & Their Importance

Posted by – March 30, 2008

Biology is faced with the problem:  the field in general does not understand much of what it studies.  One method of solving this is to study the smallest organisms known, in an effort to “understand 100% of something small.”  This is a typical engineering approach, to understand the smallest system first, then work up to attempting to understand larger systems.  After understanding the organisms, of course, we have a better chance at modifying them (their DNA) into something useful (or, rather, something that actually lives at all).

More