This article is the second part of my previous introduction to the PERL language. Here, I am going to show the use of the Perl language in simple bioinformatics applications. I will introduce by examples other aspects of this powerful language.
PROJECT 1: GENOME SEQUENCE ANALYSIS
The following program calculates the DNA statistics by performing the following analysis:
- A,G,C,T bases
- C+G bases
- (G-C)/(G+C) and (T-A)/(T+A) skews
- Cumulative DNA skews (or DNA walks)
- Codon statistics
The DNA skew plot is used to identify the origin and terminus of the replication fork (see for example this Wikipedia page for more details). The plot is calculated using running window method with windows of size $nw, and steps of size $step, see Figure 1.

The program in PERL used for the analysis is reported in the Appendix of this article.
Examples
1. Analysis of a bacterial genome of Lyme borreliosis disease vector
The vectors of Lyme-disease are the ticks. In Europe, Lyme disease is caused by infection with one or more pathogenic European genospecies of the spirochaete Borrelia burgdorferi and garinii, mainly transmitted by the tick Ixodes ricinus. In many cases, the bite is manifested with the formation of an erythema migans but in other cases, it can be asymptomatic. In case of a tick bite occurring in an area with high risk of Lyme disease infection, It is important to consult a doctor for a screening since Lyme disease can be cured if it is taken in time.
The program is used to perform a statistical analysis of the Borrelia burgdorferi chromosome. The chromosome sequence (AE000783) was obtained from the GenBank.
In Figure 2, the output of the program gives the nucleotide composition and the percentage of the first 6 most abundant codons.

The complete histogram of the codon usage is also calculated and reported in Figure 3.

The GC skew plots in picture.png show that for this genome the origin and the terminus the DNA replication forks coincide. The plots have been calculated using $nw=10k and $step=5k.

Finally, the cumulative skew curves are reported in Figure 4.

See Fraser et al. NATURE, Vol: 390, 580, (1997) for the details about the analysis of the genome of Borrelia burgdorferi.
2. Analysis of the genome of tracheobronchitis causing bacterium
Mycoplasma pneumoniae is a bacterium that commonly causes infections of the respiratory system. Especially in children, that cause the development of tracheobronchitis, commonly called a chest cold. Sometimes, M. pneumoniae infection can degenerate into pneumonia, a more serious infection of the lungs, which can even require treatment or care in a hospital. However, not everyone who is exposed to M. pneumoniae develops pneumonia.
The program is used to perform a statistical analysis of the genome of one of the strains of the M. pneumoniae. The sequence (NZ_CP014267.1) in Fasta format was obtained from the GenBank.
The output of the program gives the following nucleotide composition and the statistics of the sixe most used codons.

The complete histogram of the codon usage is also calculated and reported in Figure 7.

In Figure 8 and 9, the DNA skew, calculated using $nw=10k and $step=5k, and the cumulative DNA skew plots are also reported.


See Himmelreich, R., et al. Nucleic Acids Res. 1996, Vol. 24, 4420–4449 for the details about the analysis of the genome of Mycoplasma pneumoniae, and also A. Grigoriev. Analyzing genomes with cumulative skew diagrams Nucleic Acids Research, 1998, Vol. 26(10), 2286–2290.
Other internet resources
Here some internet resources for the GD library:
APPENDIX
</pre> #! /usr/bin/env perl ############################################################## # # GENOME ANALYSIS PROGRAM # # - CALCULATE THE NUCLEOTIDE COMPOSITION WITH THE HISTOGRAM # - CALCULATE ANS SAVE THE GC and TA SKEWNESS CURVES # # AUTHOR: Danilo ROCCATANO # DATE: 2009 # VERSION: 1.0 ##################################################################### use GD::Graph::lines; use GD::Graph::bars; use Data::Dumper; ############################################## # Define the variables ############################################## # input file name (fasta format) my $ifile; # Define the window and step size for the DNA skew calculation my $wl = 10000; # Window size my $st = 5000; # Step size # DNA walk print frequency my $freq = 100; ############################################## # nucleotide counter variables my $ta=0; my $tc=0; my $tg=0; my $tt=0; # check the arguments of the command line $num_args = $#ARGV + 1; if ($num_args != 1) { print "\nUsage: GenPars.pl file.fasta\n"; exit; } # Assign the file name from the command line argument $ifile=$ARGV[0]; # Load the input file $seq=loadfile($ifile); # Calculate the nuclelotide statistics my ($ll,$x,$cgsk,$ctsk)=countBases($seq,$freq); # Calculate the DNA skew my ($nd,$xx,$y1,$y2)=skewCalc($wl,$st,$seq); # # Print the averaged data # print "\033[2J"; # Clear the screen. print "\033[0;0H"; # Move to 0,0 print " NUCLEOTIDE COMPOSITION SUMMARY\n"; print "---------------------------------------------------------------\n"; print "Total Number of bases = $ll\n"; $per=$ta/$ll*100; printf "N. of ADENINES = %8d (%5.1f%):",$ta,$per; printHisto($per); $per=$tt/$ll*100; printf "N. of TYMINES = %8d (%5.1f%):",$tt,$per; printHisto($per); $per=$tc/$ll*100; printf "N. of CYTOSINES = %8d (%5.1f%):",$tc,$per; printHisto($per); $per=$tg/$ll*100; printf "N. of GUANINES = %8d (%5.1f%):",$tg,$tg/$ll*100; printHisto($per); $per=($tg+$tc)/$ll*100; printf "N. of C+G = %8d (%5.1f%):",$tc+$tg,($tc+$tg)/$ll*100; printHisto($per); # # Codon Usage # my ($ncod)=CodonsStat ($seq); # # Save the DNA skewness curves # print "Saving the skewness data in the file: skewdata.dat\n"; saveSkewData($nd,$xx,$y1,$y2); # # Plot the skew curves on a png file # push(@data,$xx); push(@data,$y1); push(@data,$y2); my $ofile="Skew.png"; print "then the skewness graphs in the file: $ofile\n"; GDgraph(\@data,$ofile); # # Plot the cumulative skew curves on a png file # #print Dumper $cgsk; my $ofile="cumulskew.png"; push(@dat,$x); push(@dat,$cgsk); push(@dat,$ctsk); print "and the cumulative skewness graphs in the file: $ofile\n"; GDgraph(\@dat,$ofile); exit; #================================================================ # # SUBROUTINES # #================================================================ # # Print the curve on a PNG file using the GD library # sub GDgraph { my ($dat,$ofile)= @_; my $mygraph = GD::Graph::lines->new(600, 300); $mygraph->set( x_label => 'Nucleotide N.', y_label => 'Skew', title => 'DNA Skew Curve', # Draw datasets in 'solid', 'dashed' and 'dotted-dashed' lines line_types => 1, # Set the thickness of line line_width => 2, # Set colors for datasets dclrs => ['black','yellow'], # set the number of the ticks displayed on the x axis x_tick_number => 10, ) or warn $mygraph->error; $mygraph->set_legend_font(GD::gdMediumBoldFont); $mygraph->set_legend('GC skew','TA skew'); my $myimage = $mygraph->plot(\@$dat) or die $mygraph->error; # Open a file for writing # Image printed in PNG file format open(PICTURE, '>',$ofile) or die("Cannot open file for writing"); # Make sure we are writing to a binary stream binmode PICTURE; # Convert the image to PNG and print it to the file PICTURE print PICTURE $myimage->png; close PICTURE; } ################################################################ # Calculate the sequence statistics and skewness ################################################################ sub skewCalc { my ($wl,$st,$seq)=@_; my $ll = length($seq); my $nw = $ll / ($wl-$st); my $pos=0; for ($j=1;$j<$nw;$j += 1) { $task=0; $gcsk=0; $a=0; $c=0; $g=0; $t=0; $pos+=$st; $id=$pos; for ($i = 0; $i < $wl; $i += 1) { $cc = substr($seq, $id, 1); if ($cc eq 'A') {$a+=1}; if ($cc eq 'C') {$c+=1}; if ($cc eq 'G') {$g+=1}; if ($cc eq 'T') {$t+=1}; # Calculate the GC DNA skewness as (G-C)/(G+C) if ($g+$c ne 0) {$gcsk+=(($g-$c)/($g+$c));} # Calculate the TA DNA skewness as (A-T)/(A+T) if ($t+$a ne 0) {$task+=(($a-$t)/($a+$t));} $id++; } push(\@xx,$pos); push(\@y1,$gcsk/$wl); push(\@y2,$task/$wl); } return ($nw,\@xx,\@y1,\@y2); } ################################################################# # Count the nucleotide in the input sequence # Calculate the cumulative skew curves ################################################################# sub countBases() { my $cg=0,$ct=0; my($dna)=@_; $l=length($dna); # # Count the nucleotides # for ($i=1;$i < $l;$i++) { $d = substr($dna,$i,1); if ($d eq "A") {$ta++;$ct--} elsif ($d eq "G") { $tg++;$cg++; } elsif ($d eq "C") { $tc++;$cg-- } elsif ($d eq "T") { $tt++;$ct++ } # # save the values of the DNA walk every $freq # if ($i%$freq eq 0) { push(\@x,$i); push(\@cgsk,$cg/$l); push(\@ctsk,$ct/$l); } } return ($l,\@x,\@cgsk,\@ctsk); } ################################################################# # Print the histogram of the % of occurrence of the nucleotides ################################################################# sub printHisto { my ($pe) = @_; for ($k=1;$k<=int($pe/10+0.5);$k++) { printf "#"; } printf "\n"; } ############################################################## # Read the sequence in FASTA format ############################################################## sub loadfile { my ($ifile)= @_; # open the file containing the sequence in FASTA format unless (open(FILE,$ifile)) { print "Could not open the file. \n"; exit; } # # Read the file using an array storing each line of the FASTA file in # each element of the array # $dna=; chomp($dna); if (substr($dna,0,1) eq ">") { #print "This is a FASTA file. Check the output.seq file \n"; @dna=; } $l1=@dna; for ($i=0;$iskewdata.dat") || die "cant open file..."; my ($nd,$x, $y1, $y2) = @_; for ($i=0;$i<$nd-1;$i += 1) { print SKFILE "@$x[$i] @$y1[$i] @$y2[$i]\n"; } close SKFILE; } sub CodonsStat () { # # Indexing the codons # $ri{'TCA'} = 1; $ri{'TCC'} = 2; $ri{'TCG'} = 3; $ri{'TCT'} = 4; $ri{'TTC'} = 5; $ri{'TTT'} = 6; $ri{'TTA'} = 7; $ri{'TTG'} = 8; $ri{'TAC'} = 9; $ri{'TAT'} = 10; $ri{'TAA'} = 11; $ri{'TAG'} = 12; $ri{'TGC'} = 13; $ri{'TGT'} = 14; $ri{'TGA'} = 15; $ri{'TGG'} = 16; $ri{'CTA'} = 17; $ri{'CTC'} = 18; $ri{'CTG'} = 19; $ri{'CTT'} = 20; $ri{'CCA'} = 21; $ri{'CCC'} = 22; $ri{'CCG'} = 23; $ri{'CCT'} = 24; $ri{'CAC'} = 25; $ri{'CAT'} = 26; $ri{'CAA'} = 27; $ri{'CAG'} = 28; $ri{'CGA'} = 29; $ri{'CGC'} = 30; $ri{'CGG'} = 31; $ri{'CGT'} = 32; $ri{'ATA'} = 33; $ri{'ATC'} = 34; $ri{'ATT'} = 35; $ri{'ATG'} = 36; $ri{'ACA'} = 37; $ri{'ACC'} = 38; $ri{'ACG'} = 39; $ri{'ACT'} = 40; $ri{'AAC'} = 41; $ri{'AAT'} = 42; $ri{'AAA'} = 43; $ri{'AAG'} = 44; $ri{'AGC'} = 45; $ri{'AGT'} = 46; $ri{'AGA'} = 47; $ri{'AGG'} = 48; $ri{'GTA'} = 49; $ri{'GTC'} = 50; $ri{'GTG'} = 51; $ri{'GTT'} = 52; $ri{'GCA'} = 53; $ri{'GCC'} = 54; $ri{'GCG'} = 55; $ri{'GCT'} = 56; $ri{'GAC'} = 57; $ri{'GAT'} = 58; $ri{'GAA'} = 59; $ri{'GAG'} = 60; $ri{'GGA'} = 61; $ri{'GGC'} = 62; $ri{'GGG'} = 63; $ri{'GGT'} = 64; $si[1] = "TCA"; $si[2] = "TCC"; $si[3] = "TCG"; $si[4] = "TCT"; $si[5] = "TTC"; $si[6] = "TTT"; $si[7] = "TTA"; $si[8] = "TTG"; $si[9] = "TAC"; $si[10] = "TAT"; $si[11] = "TAA"; $si[12] = "TAG"; $si[13] = "TGC"; $si[14] = "TGT"; $si[15] = "TGA"; $si[16] = "TGG"; $si[17] = "CTA"; $si[18] = "CTC"; $si[19] = "CTG"; $si[20] = "CTT"; $si[21] = "CCA"; $si[22] = "CCC"; $si[23] = "CCG"; $si[24] = "CCT"; $si[25] = "CAC"; $si[26] = "CAT"; $si[27] = "CAA"; $si[28] = "CAG"; $si[29] = "CGA"; $si[30] = "CGC"; $si[31] = "CGG"; $si[32] = "CGT"; $si[33] = "ATA"; $si[34] = "ATC"; $si[35] = "ATT"; $si[36] = "ATG"; $si[37] = "ACA"; $si[38] = "ACC"; $si[39] = "ACG"; $si[40] = "ACT"; $si[41] = "AAC"; $si[42] = "AAT"; $si[43] = "AAA"; $si[44] = "AAG"; $si[45] = "AGC"; $si[46] = "AGT"; $si[47] = "AGA"; $si[48] = "AGG"; $si[49] = "GTA"; $si[50] = "GTC"; $si[51] = "GTG"; $si[52] = "GTT"; $si[53] = "GCA"; $si[54] = "GCC"; $si[55] = "GCG"; $si[56] = "GCT"; $si[57] = "GAC"; $si[58] = "GAT"; $si[59] = "GAA"; $si[60] = "GAG"; $si[61] = "GGA"; $si[62] = "GGC"; $si[63] = "GGG"; $si[64] = "GGT"; my ($seq)=@_; my $ncod=0; $len= length $seq; @stats=(); #Calculates the statistics of specified codons for ($k=0;$k<=$len ;$k+=3) { $cod= substr($seq, $k, 3); #TAG ("amber") #TAA ("ochre") #TGA ("opal" or "umber") if ($cod eq 'TAG' or $cod eq 'TAA' or $cod eq 'TGA') {$sco++} $stat[$ri{$cod}]++; $ncod++; } for ($k=1;$k<=64;$k++) { $per=100.*$stat[$k]/$ncod; push(@tstat,$per); push(@sym,$si[$k]); push(@test,$si[$k],$per); } print "---------------------------------------------------------------\n"; print " CODON COMPOSITION SUMMARY\n"; printf "TOTAL NUMBER OF CODONS = %8d \n",$ncod; print "CODON ABOUNDANCE \n"; # # Make the codon usage histogram # @data1 =(); push(@data1,\@sym); # push x labels into plot data push(@data1,\@tstat); # # Sort the hash list # $cc=0; %codo=@test; foreach my $name ((sort { $codo{$b} $codo{$a} } keys %codo)) { printf "%-8s: %5.1f\n", $name, 0+$codo{$name} if ($cc new(800, 300); $my_graph2->set( x_label => 'Codons', y_label => 'Percentage', title => '', cumulate => 2, bar_spacing => 1, # shadow_depth => 4, dclrs => [ qw(red black green blue) ], transparent => 0, ); $my_graph2->set('x_labels_vertical'=> 1); my $myimage = $my_graph2->plot(\@data1) or die $my_graph2->error; open(PICTURE, '>codon.png') or die("Cannot open file for writing"); # Make sure we are writing to a binary stream binmode PICTURE; # Convert the image to PNG and print it to the file PICTURE print PICTURE $myimage->png; close PICTURE; return $ncod; }