PERL Programming II: Applications to Bioinformatics |

This article is the second part of my previous introduction to the PERL language. Here, I am going to show the use of the Perl language in simple bioinformatics applications. I will introduce by examples other aspects of this powerful language.

PROJECT 1: GENOME SEQUENCE ANALYSIS

The following program calculates the DNA statistics by performing the following analysis:

A,G,C,T bases
C+G bases
(G-C)/(G+C) and (T-A)/(T+A) skews
Cumulative DNA skews (or DNA walks)
Codon statistics

The DNA skew plot is used to identify the origin and terminus of the replication fork (see for example this Wikipedia page for more details). The plot is calculated using running window method with windows of size $nw, and steps of size $step, see Figure 1.

SkewSlidingW — Figure 1: The sliding window method for the calculation of the DNA skew curves. The analysis is performed by calculating the DNA skew values along the sequence using a window of size $nw in step of size $step.

The program in PERL used for the analysis is reported in the Appendix of this article.

Examples

1. Analysis of a bacterial genome of Lyme borreliosis disease vector

The vectors of Lyme-disease are the ticks. In Europe, Lyme disease is caused by infection with one or more pathogenic European genospecies of the spirochaete Borrelia burgdorferi and garinii, mainly transmitted by the tick Ixodes ricinus. In many cases, the bite is manifested with the formation of an erythema migans but in other cases, it can be asymptomatic. In case of a tick bite occurring in an area with high risk of Lyme disease infection, It is important to consult a doctor for a screening since Lyme disease can be cured if it is taken in time.

The program is used to perform a statistical analysis of the Borrelia burgdorferi chromosome. The chromosome sequence (AE000783) was obtained from the GenBank.

In Figure 2, the output of the program gives the nucleotide composition and the percentage of the first 6 most abundant codons.

BorreliaStat — Figure 2: Output of the analysis of the Borrelia genome.

The complete histogram of the codon usage is also calculated and reported in Figure 3.

The GC skew plots in picture.png show that for this genome the origin and the terminus the DNA replication forks coincide. The plots have been calculated using $nw=10k and $step=5k.

BOrreliaSkewplots — Figure 4: GA and TA Skew curves of Borrelia chromosome.

Finally, the cumulative skew curves are reported in Figure 4.

cumulskew — Figure 5: Cumulative GA and TA Skew curves of Borrelia chromosome.

See Fraser et al. NATURE, Vol: 390, 580, (1997) for the details about the analysis of the genome of Borrelia burgdorferi.

2. Analysis of the genome of tracheobronchitis causing bacterium

Mycoplasma pneumoniae is a bacterium that commonly causes infections of the respiratory system. Especially in children, that cause the development of tracheobronchitis, commonly called a chest cold. Sometimes, M. pneumoniae infection can degenerate into pneumonia, a more serious infection of the lungs, which can even require treatment or care in a hospital. However, not everyone who is exposed to M. pneumoniae develops pneumonia.

The program is used to perform a statistical analysis of the genome of one of the strains of the M. pneumoniae. The sequence (NZ_CP014267.1) in Fasta format was obtained from the GenBank.

The output of the program gives the following nucleotide composition and the statistics of the sixe most used codons.

MycoplasmaStat — Figure 6: Output of the analysis of the Mycoplasma genome.

The complete histogram of the codon usage is also calculated and reported in Figure 7.

In Figure 8 and 9, the DNA skew, calculated using $nw=10k and $step=5k, and the cumulative DNA skew plots are also reported.

See Himmelreich, R., et al. Nucleic Acids Res. 1996, Vol. 24, 4420–4449 for the details about the analysis of the genome of Mycoplasma pneumoniae, and also A. Grigoriev. Analyzing genomes with cumulative skew diagrams Nucleic Acids Research, 1998, Vol. 26(10), 2286–2290.

Other internet resources

Here some internet resources for the GD library:

APPENDIX

</pre>
#! /usr/bin/env perl

##############################################################
#
# GENOME ANALYSIS PROGRAM
#
# - CALCULATE THE NUCLEOTIDE COMPOSITION WITH THE HISTOGRAM
# - CALCULATE ANS SAVE THE GC and TA SKEWNESS CURVES
#
# AUTHOR: Danilo ROCCATANO
# DATE: 2009
# VERSION: 1.0
#####################################################################

use GD::Graph::lines;
use GD::Graph::bars;
use Data::Dumper;

##############################################
# Define the variables
##############################################

&nbsp;

# input file name (fasta format)
my $ifile;
# Define the window and step size for the DNA skew calculation
my $wl = 10000; # Window size
my $st = 5000; # Step size
# DNA walk print frequency
my $freq = 100;
##############################################

# nucleotide counter variables
my $ta=0; my $tc=0; my $tg=0; my $tt=0;

# check the arguments of the command line
$num_args = $#ARGV + 1;
if ($num_args != 1) {
print "\nUsage: GenPars.pl file.fasta\n";
exit;
}
# Assign the file name from the command line argument
$ifile=$ARGV[0];

# Load the input file
$seq=loadfile($ifile);
# Calculate the nuclelotide statistics
my ($ll,$x,$cgsk,$ctsk)=countBases($seq,$freq);
# Calculate the DNA skew
my ($nd,$xx,$y1,$y2)=skewCalc($wl,$st,$seq);

#
# Print the averaged data
#

print "\033[2J"; # Clear the screen.
print "\033[0;0H"; # Move to 0,0
print " NUCLEOTIDE COMPOSITION SUMMARY\n";
print "---------------------------------------------------------------\n";
print "Total Number of bases = $ll\n";
$per=$ta/$ll*100;
printf "N. of ADENINES = %8d (%5.1f%):",$ta,$per;
printHisto($per);
$per=$tt/$ll*100;
printf "N. of TYMINES = %8d (%5.1f%):",$tt,$per;
printHisto($per);
$per=$tc/$ll*100;
printf "N. of CYTOSINES = %8d (%5.1f%):",$tc,$per;
printHisto($per);
$per=$tg/$ll*100;
printf "N. of GUANINES = %8d (%5.1f%):",$tg,$tg/$ll*100;
printHisto($per);
$per=($tg+$tc)/$ll*100;
printf "N. of C+G = %8d (%5.1f%):",$tc+$tg,($tc+$tg)/$ll*100;
printHisto($per);

#
# Codon Usage
#

my ($ncod)=CodonsStat ($seq);
#
# Save the DNA skewness curves
#

print "Saving the skewness data in the file: skewdata.dat\n";
saveSkewData($nd,$xx,$y1,$y2);

&nbsp;

#
# Plot the skew curves on a png file
#
push(@data,$xx);
push(@data,$y1);
push(@data,$y2);
my $ofile="Skew.png";
print "then the skewness graphs in the file: $ofile\n";
GDgraph(\@data,$ofile);

#
# Plot the cumulative skew curves on a png file
#
#print Dumper $cgsk;
my $ofile="cumulskew.png";
push(@dat,$x);
push(@dat,$cgsk);
push(@dat,$ctsk);
print "and the cumulative skewness graphs in the file: $ofile\n";

GDgraph(\@dat,$ofile);

&nbsp;

exit;
#================================================================
#
# SUBROUTINES
#
#================================================================

#
# Print the curve on a PNG file using the GD library
#

sub GDgraph {

my ($dat,$ofile)= @_;
my $mygraph = GD::Graph::lines-&gt;new(600, 300);

$mygraph-&gt;set(
x_label =&gt; 'Nucleotide N.',
y_label =&gt; 'Skew',
title =&gt; 'DNA Skew Curve',

# Draw datasets in 'solid', 'dashed' and 'dotted-dashed' lines

line_types =&gt; 1,

# Set the thickness of line

line_width =&gt; 2,

# Set colors for datasets

dclrs =&gt; ['black','yellow'],

# set the number of the ticks displayed on the x axis
x_tick_number =&gt; 10,
) or warn $mygraph-&gt;error;

$mygraph-&gt;set_legend_font(GD::gdMediumBoldFont);
$mygraph-&gt;set_legend('GC skew','TA skew');

my $myimage = $mygraph-&gt;plot(\@$dat) or die $mygraph-&gt;error;

# Open a file for writing
# Image printed in PNG file format
open(PICTURE, '&gt;',$ofile) or die("Cannot open file for writing");

# Make sure we are writing to a binary stream
binmode PICTURE;

# Convert the image to PNG and print it to the file PICTURE
print PICTURE $myimage-&gt;png;
close PICTURE;

}

################################################################
# Calculate the sequence statistics and skewness
################################################################

sub skewCalc {

my ($wl,$st,$seq)=@_;
my $ll = length($seq);
my $nw = $ll / ($wl-$st);
my $pos=0;

for ($j=1;$j&lt;$nw;$j += 1) {
$task=0;
$gcsk=0;
$a=0;
$c=0;
$g=0;
$t=0;
$pos+=$st;
$id=$pos;
for ($i = 0; $i &lt; $wl; $i += 1) {
$cc = substr($seq, $id, 1);
if ($cc eq &#039;A&#039;) {$a+=1};
if ($cc eq &#039;C&#039;) {$c+=1};
if ($cc eq &#039;G&#039;) {$g+=1};
if ($cc eq &#039;T&#039;) {$t+=1};

# Calculate the GC DNA skewness as (G-C)/(G+C)
if ($g+$c ne 0) {$gcsk+=(($g-$c)/($g+$c));}
# Calculate the TA DNA skewness as (A-T)/(A+T)
if ($t+$a ne 0) {$task+=(($a-$t)/($a+$t));}
$id++;
}
push(\@xx,$pos);
push(\@y1,$gcsk/$wl);
push(\@y2,$task/$wl);
}
return ($nw,\@xx,\@y1,\@y2);
}

#################################################################
# Count the nucleotide in the input sequence
# Calculate the cumulative skew curves
#################################################################
sub countBases() {
my $cg=0,$ct=0;

my($dna)=@_;
$l=length($dna);
#
# Count the nucleotides
#
for ($i=1;$i &lt; $l;$i++)
{
$d = substr($dna,$i,1);

if ($d eq &quot;A&quot;) {$ta++;$ct--}
elsif ($d eq &quot;G&quot;) { $tg++;$cg++; }
elsif ($d eq &quot;C&quot;) { $tc++;$cg-- }
elsif ($d eq &quot;T&quot;) { $tt++;$ct++ }
#
# save the values of the DNA walk every $freq
#
if ($i%$freq eq 0) {
push(\@x,$i);
push(\@cgsk,$cg/$l);
push(\@ctsk,$ct/$l);
}
}
return ($l,\@x,\@cgsk,\@ctsk);
}

#################################################################
# Print the histogram of the % of occurrence of the nucleotides
#################################################################

sub printHisto {
my ($pe) = @_;
for ($k=1;$k&lt;=int($pe/10+0.5);$k++) {
printf &quot;#&quot;;
}
printf &quot;\n&quot;;
}

##############################################################
# Read the sequence in FASTA format
##############################################################

sub loadfile {

my ($ifile)= @_;

# open the file containing the sequence in FASTA format

unless (open(FILE,$ifile)) {
print &quot;Could not open the file. \n&quot;;
exit;
}

#
# Read the file using an array storing each line of the FASTA file in
# each element of the array
#
$dna=;
chomp($dna);
if (substr($dna,0,1) eq "&gt;") {
#print "This is a FASTA file. Check the output.seq file \n";
@dna=;
}

$l1=@dna;
for ($i=0;$iskewdata.dat") || die "cant open file...";
my ($nd,$x, $y1, $y2) = @_;

for ($i=0;$i&lt;$nd-1;$i += 1) {
print SKFILE &quot;@$x[$i] @$y1[$i] @$y2[$i]\n&quot;;
}
close SKFILE;
}

sub CodonsStat () {

#
# Indexing the codons
#

$ri{&#039;TCA&#039;} = 1;
$ri{&#039;TCC&#039;} = 2;
$ri{&#039;TCG&#039;} = 3;
$ri{&#039;TCT&#039;} = 4;
$ri{&#039;TTC&#039;} = 5;
$ri{&#039;TTT&#039;} = 6;
$ri{&#039;TTA&#039;} = 7;
$ri{&#039;TTG&#039;} = 8;
$ri{&#039;TAC&#039;} = 9;
$ri{&#039;TAT&#039;} = 10;
$ri{&#039;TAA&#039;} = 11;
$ri{&#039;TAG&#039;} = 12;
$ri{&#039;TGC&#039;} = 13;
$ri{&#039;TGT&#039;} = 14;
$ri{&#039;TGA&#039;} = 15;
$ri{&#039;TGG&#039;} = 16;
$ri{&#039;CTA&#039;} = 17;
$ri{&#039;CTC&#039;} = 18;
$ri{&#039;CTG&#039;} = 19;
$ri{&#039;CTT&#039;} = 20;
$ri{&#039;CCA&#039;} = 21;
$ri{&#039;CCC&#039;} = 22;
$ri{&#039;CCG&#039;} = 23;
$ri{&#039;CCT&#039;} = 24;
$ri{&#039;CAC&#039;} = 25;
$ri{&#039;CAT&#039;} = 26;
$ri{&#039;CAA&#039;} = 27;
$ri{&#039;CAG&#039;} = 28;
$ri{&#039;CGA&#039;} = 29;
$ri{&#039;CGC&#039;} = 30;
$ri{&#039;CGG&#039;} = 31;
$ri{&#039;CGT&#039;} = 32;
$ri{&#039;ATA&#039;} = 33;
$ri{&#039;ATC&#039;} = 34;
$ri{&#039;ATT&#039;} = 35;
$ri{&#039;ATG&#039;} = 36;
$ri{&#039;ACA&#039;} = 37;
$ri{&#039;ACC&#039;} = 38;
$ri{&#039;ACG&#039;} = 39;
$ri{&#039;ACT&#039;} = 40;
$ri{&#039;AAC&#039;} = 41;
$ri{&#039;AAT&#039;} = 42;
$ri{&#039;AAA&#039;} = 43;
$ri{&#039;AAG&#039;} = 44;
$ri{&#039;AGC&#039;} = 45;
$ri{&#039;AGT&#039;} = 46;
$ri{&#039;AGA&#039;} = 47;
$ri{&#039;AGG&#039;} = 48;
$ri{&#039;GTA&#039;} = 49;
$ri{&#039;GTC&#039;} = 50;
$ri{&#039;GTG&#039;} = 51;
$ri{&#039;GTT&#039;} = 52;
$ri{&#039;GCA&#039;} = 53;
$ri{&#039;GCC&#039;} = 54;
$ri{&#039;GCG&#039;} = 55;
$ri{&#039;GCT&#039;} = 56;
$ri{&#039;GAC&#039;} = 57;
$ri{&#039;GAT&#039;} = 58;
$ri{&#039;GAA&#039;} = 59;
$ri{&#039;GAG&#039;} = 60;
$ri{&#039;GGA&#039;} = 61;
$ri{&#039;GGC&#039;} = 62;
$ri{&#039;GGG&#039;} = 63;
$ri{&#039;GGT&#039;} = 64;
$si[1] = &quot;TCA&quot;;
$si[2] = &quot;TCC&quot;;
$si[3] = &quot;TCG&quot;;
$si[4] = &quot;TCT&quot;;
$si[5] = &quot;TTC&quot;;
$si[6] = &quot;TTT&quot;;
$si[7] = &quot;TTA&quot;;
$si[8] = &quot;TTG&quot;;
$si[9] = &quot;TAC&quot;;
$si[10] = &quot;TAT&quot;;
$si[11] = &quot;TAA&quot;;
$si[12] = &quot;TAG&quot;;
$si[13] = &quot;TGC&quot;;
$si[14] = &quot;TGT&quot;;
$si[15] = &quot;TGA&quot;;
$si[16] = &quot;TGG&quot;;
$si[17] = &quot;CTA&quot;;
$si[18] = &quot;CTC&quot;;
$si[19] = &quot;CTG&quot;;
$si[20] = &quot;CTT&quot;;
$si[21] = &quot;CCA&quot;;
$si[22] = &quot;CCC&quot;;
$si[23] = &quot;CCG&quot;;
$si[24] = &quot;CCT&quot;;
$si[25] = &quot;CAC&quot;;
$si[26] = &quot;CAT&quot;;
$si[27] = &quot;CAA&quot;;
$si[28] = &quot;CAG&quot;;
$si[29] = &quot;CGA&quot;;
$si[30] = &quot;CGC&quot;;
$si[31] = &quot;CGG&quot;;
$si[32] = &quot;CGT&quot;;
$si[33] = &quot;ATA&quot;;
$si[34] = &quot;ATC&quot;;
$si[35] = &quot;ATT&quot;;
$si[36] = &quot;ATG&quot;;
$si[37] = &quot;ACA&quot;;
$si[38] = &quot;ACC&quot;;
$si[39] = &quot;ACG&quot;;
$si[40] = &quot;ACT&quot;;
$si[41] = &quot;AAC&quot;;
$si[42] = &quot;AAT&quot;;
$si[43] = &quot;AAA&quot;;
$si[44] = &quot;AAG&quot;;
$si[45] = &quot;AGC&quot;;
$si[46] = &quot;AGT&quot;;
$si[47] = &quot;AGA&quot;;
$si[48] = &quot;AGG&quot;;
$si[49] = &quot;GTA&quot;;
$si[50] = &quot;GTC&quot;;
$si[51] = &quot;GTG&quot;;
$si[52] = &quot;GTT&quot;;
$si[53] = &quot;GCA&quot;;
$si[54] = &quot;GCC&quot;;
$si[55] = &quot;GCG&quot;;
$si[56] = &quot;GCT&quot;;
$si[57] = &quot;GAC&quot;;
$si[58] = &quot;GAT&quot;;
$si[59] = &quot;GAA&quot;;
$si[60] = &quot;GAG&quot;;
$si[61] = &quot;GGA&quot;;
$si[62] = &quot;GGC&quot;;
$si[63] = &quot;GGG&quot;;
$si[64] = &quot;GGT&quot;;

my ($seq)=@_;

my $ncod=0;

$len= length $seq;
@stats=();

#Calculates the statistics of specified codons

for ($k=0;$k&lt;=$len ;$k+=3) {
$cod= substr($seq, $k, 3);
#TAG (&quot;amber&quot;)
#TAA (&quot;ochre&quot;)
#TGA (&quot;opal&quot; or &quot;umber&quot;)
if ($cod eq &#039;TAG&#039; or $cod eq &#039;TAA&#039; or $cod eq &#039;TGA&#039;) {$sco++}
$stat[$ri{$cod}]++;
$ncod++;
}
for ($k=1;$k&lt;=64;$k++) {
$per=100.*$stat[$k]/$ncod;
push(@tstat,$per);
push(@sym,$si[$k]);
push(@test,$si[$k],$per);
}

print &quot;---------------------------------------------------------------\n&quot;;
print &quot; CODON COMPOSITION SUMMARY\n&quot;;
printf &quot;TOTAL NUMBER OF CODONS = %8d \n&quot;,$ncod;
print &quot;CODON ABOUNDANCE \n&quot;;

#
# Make the codon usage histogram
#

@data1 =();
push(@data1,\@sym); # push x labels into plot data
push(@data1,\@tstat);

#
# Sort the hash list
#
$cc=0;
%codo=@test;
foreach my $name ((sort { $codo{$b}  $codo{$a} } keys %codo)) {
printf "%-8s: %5.1f\n", $name, 0+$codo{$name} if ($cc new(800, 300);

$my_graph2-&gt;set(
x_label =&gt; 'Codons',
y_label =&gt; 'Percentage',
title =&gt; '',
cumulate =&gt; 2,
bar_spacing =&gt; 1,
# shadow_depth =&gt; 4,
dclrs =&gt; [ qw(red black green blue) ],
transparent =&gt; 0,
);
$my_graph2-&gt;set('x_labels_vertical'=&gt; 1);
my $myimage = $my_graph2-&gt;plot(\@data1) or die $my_graph2-&gt;error;

open(PICTURE, '&gt;codon.png') or die("Cannot open file for writing");

# Make sure we are writing to a binary stream
binmode PICTURE;

# Convert the image to PNG and print it to the file PICTURE
print PICTURE $myimage-&gt;png;
close PICTURE;

return $ncod;

}

"… I seem […] only like a boy playing on the sea-shore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me". – Isaac Newton.

PERL Programming II: Applications to Bioinformatics

PROJECT 1: GENOME SEQUENCE ANALYSIS

Examples

1. Analysis of a bacterial genome of Lyme borreliosis disease vector

2. Analysis of the genome of tracheobronchitis causing bacterium

Other internet resources

APPENDIX

NOTE: I disclose from any responsibility for possible errors in the reported program but I encourage to notify the problem so that I can correct it!

Leave a comment Cancel reply

PROJECT 1: GENOME SEQUENCE ANALYSIS

Examples

1. Analysis of a bacterial genome of Lyme borreliosis disease vector

2. Analysis of the genome of tracheobronchitis causing bacterium

Other internet resources

APPENDIX

NOTE: I disclose from any responsibility for possible errors in the reported program but I encourage to notify the problem so that I can correct it!

Share this:

Related

Leave a comment Cancel reply