PERL Programming II: Applications to Bioinformatics

This article is the second part of my previous introduction to the PERL language. Here, I am going to show the use of the Perl language in simple bioinformatics applications. I will introduce by examples other aspects of this powerful language.

PROJECT 1: GENOME SEQUENCE ANALYSIS

The following program calculates the DNA statistics by performing the following analysis:

  • A,G,C,T bases
  • C+G bases
  • (G-C)/(G+C) and (T-A)/(T+A) skews
  • Cumulative DNA skews (or DNA walks)
  • Codon statistics

The DNA skew plot is used to identify the origin and terminus of the replication fork (see for example this Wikipedia page for more details). The plot is calculated using running window method with windows of size $nw, and steps of size $step, see Figure 1.

SkewSlidingW

Figure 1: The sliding window method for the calculation of the DNA skew curves. The analysis is performed by calculating the DNA skew values along the sequence using a window of size $nw in step of size $step.

The program in PERL used for the analysis is reported in the Appendix of this article.

Examples

1. Analysis of a bacterial genome of Lyme borreliosis disease vector

The vectors of Lyme-disease are the ticks. In Europe, Lyme disease is caused by infection with one or more pathogenic European genospecies of the spirochaete Borrelia burgdorferi and garinii, mainly transmitted by the tick Ixodes ricinus.  In many cases, the bite is manifested with the formation of an erythema migans but in other cases, it can be asymptomatic. In case of a tick bite occurring in an area with high risk of Lyme disease infection, It is important to consult a doctor for a screening since Lyme disease can be cured if it is taken in time.

The program is used to perform a statistical analysis of the Borrelia burgdorferi chromosome. The chromosome sequence (AE000783) was obtained from the GenBank.

In Figure 2, the output of the program gives the nucleotide composition and the percentage of the first 6 most abundant codons.

 

BorreliaStat

Figure 2: Output of the analysis of the Borrelia genome.

 

The complete histogram of the codon usage is also calculated and reported in Figure 3.

 

codon

Figure 3: Codon usage in Borrelia chromosome.

 

The GC skew plots in picture.png show that for this genome the origin and the terminus the DNA replication forks coincide. The plots have been calculated using $nw=10k and $step=5k.

 

BOrreliaSkewplots

Figure 4: GA and TA Skew curves of Borrelia chromosome.

 

Finally, the cumulative skew curves are reported in Figure 4.

 

cumulskew

Figure 5: Cumulative GA and TA Skew curves of Borrelia chromosome.

 

See  Fraser et al. NATURE, Vol: 390, 580,  (1997) for the details about the analysis of the genome of Borrelia burgdorferi.

2. Analysis of the genome of tracheobronchitis causing bacterium

Mycoplasma pneumoniae is a bacterium that commonly causes infections of the respiratory system. Especially in children, that cause the development of tracheobronchitis, commonly called a chest cold. Sometimes, M. pneumoniae infection can degenerate into pneumonia, a more serious infection of the lungs, which can even require treatment or care in a hospital. However, not everyone who is exposed to M. pneumoniae develops pneumonia.

The program is used to perform a statistical analysis of the genome of one of the strains of the M. pneumoniae. The sequence (NZ_CP014267.1) in Fasta format was obtained from the GenBank.

The output of the program gives the following nucleotide composition and the statistics of the sixe most used codons.

 

MycoplasmaStat

Figure 6: Output of the analysis of the Mycoplasma genome.

 

The complete histogram of the codon usage is also calculated and reported in Figure 7.

 

codon

Figure 7: Codon usage in Mycoplasma.

 

 

In Figure 8 and 9, the DNA skew, calculated using $nw=10k and $step=5k, and the cumulative DNA skew plots are also reported.

Skew

Figure 8: GA and TA skew curves of Mycoplasma chromosome.

 

 

 

cumulskew

Figure 9: Cumulative GA and TA Skew curves of Mycoplasma genome.

 

See Himmelreich, R., et al. Nucleic Acids Res. 1996, Vol. 24, 4420–4449 for the details about the analysis of the genome of  Mycoplasma pneumoniae, and also A. Grigoriev. Analyzing genomes with cumulative skew diagrams Nucleic Acids Research, 1998, Vol. 26(10), 2286–2290.

Other internet resources

Here some internet resources for the GD library:

APPENDIX

 

</pre>
#! /usr/bin/env perl

##############################################################
#
# GENOME ANALYSIS PROGRAM
#
# - CALCULATE THE NUCLEOTIDE COMPOSITION WITH THE HISTOGRAM
# - CALCULATE ANS SAVE THE GC and TA SKEWNESS CURVES
#
# AUTHOR: Danilo ROCCATANO
# DATE: 2009
# VERSION: 1.0
#####################################################################

use GD::Graph::lines;
use GD::Graph::bars;
use Data::Dumper;

##############################################
# Define the variables
##############################################

&nbsp;

# input file name (fasta format)
my $ifile;
# Define the window and step size for the DNA skew calculation
my $wl = 10000; # Window size
my $st = 5000; # Step size
# DNA walk print frequency
my $freq = 100;
##############################################

# nucleotide counter variables
my $ta=0; my $tc=0; my $tg=0; my $tt=0;

# check the arguments of the command line
$num_args = $#ARGV + 1;
if ($num_args != 1) {
print "\nUsage: GenPars.pl file.fasta\n";
exit;
}
# Assign the file name from the command line argument
$ifile=$ARGV[0];

# Load the input file
$seq=loadfile($ifile);
# Calculate the nuclelotide statistics
my ($ll,$x,$cgsk,$ctsk)=countBases($seq,$freq);
# Calculate the DNA skew
my ($nd,$xx,$y1,$y2)=skewCalc($wl,$st,$seq);

#
# Print the averaged data
#

print "\033[2J"; # Clear the screen.
print "\033[0;0H"; # Move to 0,0
print " NUCLEOTIDE COMPOSITION SUMMARY\n";
print "---------------------------------------------------------------\n";
print "Total Number of bases = $ll\n";
$per=$ta/$ll*100;
printf "N. of ADENINES = %8d (%5.1f%):",$ta,$per;
printHisto($per);
$per=$tt/$ll*100;
printf "N. of TYMINES = %8d (%5.1f%):",$tt,$per;
printHisto($per);
$per=$tc/$ll*100;
printf "N. of CYTOSINES = %8d (%5.1f%):",$tc,$per;
printHisto($per);
$per=$tg/$ll*100;
printf "N. of GUANINES = %8d (%5.1f%):",$tg,$tg/$ll*100;
printHisto($per);
$per=($tg+$tc)/$ll*100;
printf "N. of C+G = %8d (%5.1f%):",$tc+$tg,($tc+$tg)/$ll*100;
printHisto($per);

#
# Codon Usage
#

my ($ncod)=CodonsStat ($seq);
#
# Save the DNA skewness curves
#

print "Saving the skewness data in the file: skewdata.dat\n";
saveSkewData($nd,$xx,$y1,$y2);

&nbsp;

#
# Plot the skew curves on a png file
#
push(@data,$xx);
push(@data,$y1);
push(@data,$y2);
my $ofile="Skew.png";
print "then the skewness graphs in the file: $ofile\n";
GDgraph(\@data,$ofile);

#
# Plot the cumulative skew curves on a png file
#
#print Dumper $cgsk;
my $ofile="cumulskew.png";
push(@dat,$x);
push(@dat,$cgsk);
push(@dat,$ctsk);
print "and the cumulative skewness graphs in the file: $ofile\n";

GDgraph(\@dat,$ofile);

&nbsp;

exit;
#================================================================
#
# SUBROUTINES
#
#================================================================

#
# Print the curve on a PNG file using the GD library
#

sub GDgraph {

my ($dat,$ofile)= @_;
my $mygraph = GD::Graph::lines-&gt;new(600, 300);

$mygraph-&gt;set(
x_label =&gt; 'Nucleotide N.',
y_label =&gt; 'Skew',
title =&gt; 'DNA Skew Curve',

# Draw datasets in 'solid', 'dashed' and 'dotted-dashed' lines

line_types =&gt; 1,

# Set the thickness of line

line_width =&gt; 2,

# Set colors for datasets

dclrs =&gt; ['black','yellow'],

# set the number of the ticks displayed on the x axis
x_tick_number =&gt; 10,
) or warn $mygraph-&gt;error;

$mygraph-&gt;set_legend_font(GD::gdMediumBoldFont);
$mygraph-&gt;set_legend('GC skew','TA skew');

my $myimage = $mygraph-&gt;plot(\@$dat) or die $mygraph-&gt;error;

# Open a file for writing
# Image printed in PNG file format
open(PICTURE, '&gt;',$ofile) or die("Cannot open file for writing");

# Make sure we are writing to a binary stream
binmode PICTURE;

# Convert the image to PNG and print it to the file PICTURE
print PICTURE $myimage-&gt;png;
close PICTURE;

}

################################################################
# Calculate the sequence statistics and skewness
################################################################

sub skewCalc {

my ($wl,$st,$seq)=@_;
my $ll = length($seq);
my $nw = $ll / ($wl-$st);
my $pos=0;

for ($j=1;$j&lt;$nw;$j += 1) {
$task=0;
$gcsk=0;
$a=0;
$c=0;
$g=0;
$t=0;
$pos+=$st;
$id=$pos;
for ($i = 0; $i &lt; $wl; $i += 1) {
$cc = substr($seq, $id, 1);
if ($cc eq &#039;A&#039;) {$a+=1};
if ($cc eq &#039;C&#039;) {$c+=1};
if ($cc eq &#039;G&#039;) {$g+=1};
if ($cc eq &#039;T&#039;) {$t+=1};

# Calculate the GC DNA skewness as (G-C)/(G+C)
if ($g+$c ne 0) {$gcsk+=(($g-$c)/($g+$c));}
# Calculate the TA DNA skewness as (A-T)/(A+T)
if ($t+$a ne 0) {$task+=(($a-$t)/($a+$t));}
$id++;
}
push(\@xx,$pos);
push(\@y1,$gcsk/$wl);
push(\@y2,$task/$wl);
}
return ($nw,\@xx,\@y1,\@y2);
}

#################################################################
# Count the nucleotide in the input sequence
# Calculate the cumulative skew curves
#################################################################
sub countBases() {
my $cg=0,$ct=0;

my($dna)=@_;
$l=length($dna);
#
# Count the nucleotides
#
for ($i=1;$i &lt; $l;$i++)
{
$d = substr($dna,$i,1);

if ($d eq &quot;A&quot;) {$ta++;$ct--}
elsif ($d eq &quot;G&quot;) { $tg++;$cg++; }
elsif ($d eq &quot;C&quot;) { $tc++;$cg-- }
elsif ($d eq &quot;T&quot;) { $tt++;$ct++ }
#
# save the values of the DNA walk every $freq
#
if ($i%$freq eq 0) {
push(\@x,$i);
push(\@cgsk,$cg/$l);
push(\@ctsk,$ct/$l);
}
}
return ($l,\@x,\@cgsk,\@ctsk);
}

#################################################################
# Print the histogram of the % of occurrence of the nucleotides
#################################################################

sub printHisto {
my ($pe) = @_;
for ($k=1;$k&lt;=int($pe/10+0.5);$k++) {
printf &quot;#&quot;;
}
printf &quot;\n&quot;;
}

##############################################################
# Read the sequence in FASTA format
##############################################################

sub loadfile {

my ($ifile)= @_;

# open the file containing the sequence in FASTA format

unless (open(FILE,$ifile)) {
print &quot;Could not open the file. \n&quot;;
exit;
}

#
# Read the file using an array storing each line of the FASTA file in
# each element of the array
#
$dna=;
chomp($dna);
if (substr($dna,0,1) eq "&gt;") {
#print "This is a FASTA file. Check the output.seq file \n";
@dna=;
}

$l1=@dna;
for ($i=0;$iskewdata.dat") || die "cant open file...";
my ($nd,$x, $y1, $y2) = @_;

for ($i=0;$i&lt;$nd-1;$i += 1) {
print SKFILE &quot;@$x[$i] @$y1[$i] @$y2[$i]\n&quot;;
}
close SKFILE;
}

sub CodonsStat () {

#
# Indexing the codons
#

$ri{&#039;TCA&#039;} = 1;
$ri{&#039;TCC&#039;} = 2;
$ri{&#039;TCG&#039;} = 3;
$ri{&#039;TCT&#039;} = 4;
$ri{&#039;TTC&#039;} = 5;
$ri{&#039;TTT&#039;} = 6;
$ri{&#039;TTA&#039;} = 7;
$ri{&#039;TTG&#039;} = 8;
$ri{&#039;TAC&#039;} = 9;
$ri{&#039;TAT&#039;} = 10;
$ri{&#039;TAA&#039;} = 11;
$ri{&#039;TAG&#039;} = 12;
$ri{&#039;TGC&#039;} = 13;
$ri{&#039;TGT&#039;} = 14;
$ri{&#039;TGA&#039;} = 15;
$ri{&#039;TGG&#039;} = 16;
$ri{&#039;CTA&#039;} = 17;
$ri{&#039;CTC&#039;} = 18;
$ri{&#039;CTG&#039;} = 19;
$ri{&#039;CTT&#039;} = 20;
$ri{&#039;CCA&#039;} = 21;
$ri{&#039;CCC&#039;} = 22;
$ri{&#039;CCG&#039;} = 23;
$ri{&#039;CCT&#039;} = 24;
$ri{&#039;CAC&#039;} = 25;
$ri{&#039;CAT&#039;} = 26;
$ri{&#039;CAA&#039;} = 27;
$ri{&#039;CAG&#039;} = 28;
$ri{&#039;CGA&#039;} = 29;
$ri{&#039;CGC&#039;} = 30;
$ri{&#039;CGG&#039;} = 31;
$ri{&#039;CGT&#039;} = 32;
$ri{&#039;ATA&#039;} = 33;
$ri{&#039;ATC&#039;} = 34;
$ri{&#039;ATT&#039;} = 35;
$ri{&#039;ATG&#039;} = 36;
$ri{&#039;ACA&#039;} = 37;
$ri{&#039;ACC&#039;} = 38;
$ri{&#039;ACG&#039;} = 39;
$ri{&#039;ACT&#039;} = 40;
$ri{&#039;AAC&#039;} = 41;
$ri{&#039;AAT&#039;} = 42;
$ri{&#039;AAA&#039;} = 43;
$ri{&#039;AAG&#039;} = 44;
$ri{&#039;AGC&#039;} = 45;
$ri{&#039;AGT&#039;} = 46;
$ri{&#039;AGA&#039;} = 47;
$ri{&#039;AGG&#039;} = 48;
$ri{&#039;GTA&#039;} = 49;
$ri{&#039;GTC&#039;} = 50;
$ri{&#039;GTG&#039;} = 51;
$ri{&#039;GTT&#039;} = 52;
$ri{&#039;GCA&#039;} = 53;
$ri{&#039;GCC&#039;} = 54;
$ri{&#039;GCG&#039;} = 55;
$ri{&#039;GCT&#039;} = 56;
$ri{&#039;GAC&#039;} = 57;
$ri{&#039;GAT&#039;} = 58;
$ri{&#039;GAA&#039;} = 59;
$ri{&#039;GAG&#039;} = 60;
$ri{&#039;GGA&#039;} = 61;
$ri{&#039;GGC&#039;} = 62;
$ri{&#039;GGG&#039;} = 63;
$ri{&#039;GGT&#039;} = 64;
$si[1] = &quot;TCA&quot;;
$si[2] = &quot;TCC&quot;;
$si[3] = &quot;TCG&quot;;
$si[4] = &quot;TCT&quot;;
$si[5] = &quot;TTC&quot;;
$si[6] = &quot;TTT&quot;;
$si[7] = &quot;TTA&quot;;
$si[8] = &quot;TTG&quot;;
$si[9] = &quot;TAC&quot;;
$si[10] = &quot;TAT&quot;;
$si[11] = &quot;TAA&quot;;
$si[12] = &quot;TAG&quot;;
$si[13] = &quot;TGC&quot;;
$si[14] = &quot;TGT&quot;;
$si[15] = &quot;TGA&quot;;
$si[16] = &quot;TGG&quot;;
$si[17] = &quot;CTA&quot;;
$si[18] = &quot;CTC&quot;;
$si[19] = &quot;CTG&quot;;
$si[20] = &quot;CTT&quot;;
$si[21] = &quot;CCA&quot;;
$si[22] = &quot;CCC&quot;;
$si[23] = &quot;CCG&quot;;
$si[24] = &quot;CCT&quot;;
$si[25] = &quot;CAC&quot;;
$si[26] = &quot;CAT&quot;;
$si[27] = &quot;CAA&quot;;
$si[28] = &quot;CAG&quot;;
$si[29] = &quot;CGA&quot;;
$si[30] = &quot;CGC&quot;;
$si[31] = &quot;CGG&quot;;
$si[32] = &quot;CGT&quot;;
$si[33] = &quot;ATA&quot;;
$si[34] = &quot;ATC&quot;;
$si[35] = &quot;ATT&quot;;
$si[36] = &quot;ATG&quot;;
$si[37] = &quot;ACA&quot;;
$si[38] = &quot;ACC&quot;;
$si[39] = &quot;ACG&quot;;
$si[40] = &quot;ACT&quot;;
$si[41] = &quot;AAC&quot;;
$si[42] = &quot;AAT&quot;;
$si[43] = &quot;AAA&quot;;
$si[44] = &quot;AAG&quot;;
$si[45] = &quot;AGC&quot;;
$si[46] = &quot;AGT&quot;;
$si[47] = &quot;AGA&quot;;
$si[48] = &quot;AGG&quot;;
$si[49] = &quot;GTA&quot;;
$si[50] = &quot;GTC&quot;;
$si[51] = &quot;GTG&quot;;
$si[52] = &quot;GTT&quot;;
$si[53] = &quot;GCA&quot;;
$si[54] = &quot;GCC&quot;;
$si[55] = &quot;GCG&quot;;
$si[56] = &quot;GCT&quot;;
$si[57] = &quot;GAC&quot;;
$si[58] = &quot;GAT&quot;;
$si[59] = &quot;GAA&quot;;
$si[60] = &quot;GAG&quot;;
$si[61] = &quot;GGA&quot;;
$si[62] = &quot;GGC&quot;;
$si[63] = &quot;GGG&quot;;
$si[64] = &quot;GGT&quot;;

my ($seq)=@_;

my $ncod=0;

$len= length $seq;
@stats=();

#Calculates the statistics of specified codons

for ($k=0;$k&lt;=$len ;$k+=3) {
$cod= substr($seq, $k, 3);
#TAG (&quot;amber&quot;)
#TAA (&quot;ochre&quot;)
#TGA (&quot;opal&quot; or &quot;umber&quot;)
if ($cod eq &#039;TAG&#039; or $cod eq &#039;TAA&#039; or $cod eq &#039;TGA&#039;) {$sco++}
$stat[$ri{$cod}]++;
$ncod++;
}
for ($k=1;$k&lt;=64;$k++) {
$per=100.*$stat[$k]/$ncod;
push(@tstat,$per);
push(@sym,$si[$k]);
push(@test,$si[$k],$per);
}

print &quot;---------------------------------------------------------------\n&quot;;
print &quot; CODON COMPOSITION SUMMARY\n&quot;;
printf &quot;TOTAL NUMBER OF CODONS = %8d \n&quot;,$ncod;
print &quot;CODON ABOUNDANCE \n&quot;;

#
# Make the codon usage histogram
#

@data1 =();
push(@data1,\@sym); # push x labels into plot data
push(@data1,\@tstat);

#
# Sort the hash list
#
$cc=0;
%codo=@test;
foreach my $name ((sort { $codo{$b}  $codo{$a} } keys %codo)) {
printf "%-8s: %5.1f\n", $name, 0+$codo{$name} if ($cc new(800, 300);

$my_graph2-&gt;set(
x_label =&gt; 'Codons',
y_label =&gt; 'Percentage',
title =&gt; '',
cumulate =&gt; 2,
bar_spacing =&gt; 1,
# shadow_depth =&gt; 4,
dclrs =&gt; [ qw(red black green blue) ],
transparent =&gt; 0,
);
$my_graph2-&gt;set('x_labels_vertical'=&gt; 1);
my $myimage = $my_graph2-&gt;plot(\@data1) or die $my_graph2-&gt;error;

open(PICTURE, '&gt;codon.png') or die("Cannot open file for writing");

# Make sure we are writing to a binary stream
binmode PICTURE;

# Convert the image to PNG and print it to the file PICTURE
print PICTURE $myimage-&gt;png;
close PICTURE;

return $ncod;

}

NOTE: I disclose from any responsibility for possible errors in the reported program but I encourage to notify the problem so that I can correct it!

 

About Danilo Roccatano

I have a Doctorate in chemistry at the University of Roma “La Sapienza”. I led educational and research activities at different universities in Italy, The Netherlands, Germany and now in the UK. I am fascinated by the study of nature with theoretical models and computational. For years, my scientific research is focused on the study of molecular systems of biological interest using the technique of Molecular Dynamics simulation. I have developed a server (the link is in one of my post) for statistical analysis at the amino acid level of the effect of random mutations induced by random mutagenesis methods. I am also very active in the didactic activity in physical chemistry, computational chemistry, and molecular modeling. I have several other interests and hobbies as video/photography, robotics, computer vision, electronics, programming, microscopy, entomology, recreational mathematics and computational linguistics.
This entry was posted in Programming. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s