My Part 1 of 5 review of the book, Beginning Perl for Bioinformatics, by James Tisdall, left off at Chapter 8, just before Tisdall explains associative arrays, gene expression, FASTA files, genomic databases, and restriction sites.
Tisdall: “For simplicity, let’s say you have the names for all the genes in the organism and a number for the expressed genes indicating the level of the expression in your experiment; the unexpressed genes have the number 0. Now let’s suppose you want to know if the genes were expressed, but not the expression levels, and you want to solve this programming problem using arrays. After all, you are somewhat familiar with arrays by this point. How do you proceed?”
Perl’s associative arrays are one of the most powerful aspects of the language. This is a good problem to examine using hashes. Solutions to this kind of problem in other languages (C or matlab) might create an N-dimensional array (or even NxM) as a matrix representation of the problem. In C, it might be solved using a lookup table possibly using a linked list, and the code to drive that needs to be written from scratch or borrowed from an external library. Perl has a built-in method to solve these kinds of problems.
The solution is to use a hash:
$gene_name = "triA"; $level = 10; $expression_levels{$gene_name} = $level; # save 'level' on per-gene basis
This leads Tisdall to review biological transcription and translation, including code for DNA->RNA and RNA->protein data conversion. The code is given in long form and then optimized in further examples for speed using associative arrays. Recall the central dogma of biology: