Introduction to the PERL Language

PERL is an acronym for Practical Extraction and Report Language. This scripting language was initially developed by Larry Wall with the intent to extend the potentiality of the awk and sed program for text manipulation and for Unix system administration tool. It takes the best features of many other languages, such as C, sed, and awk. In addition, Perl supports both procedural and object-oriented programming. Perl is the most popular web programming language due to its capability with text manipulation and rapid development cycle. The same capabilities began a precious support to bioinformatician to data mining the rapid accumulation of a large amount of genetic information from the molecular biology research.

These are notes from a course on Bioinformatics that I taught at the Jacobs University Bremen (Germany) in 2009. This introduction is far from being comprehensive but it gives you some guidelines to start a simple use of this sophisticated scripting language.

What can you do with Perl?

Perl is a complete programming language that combines the capability of traditional compiled languages with ad hoc tools for parsing text. It is supported by a huge amount of libraries that extend its capability for specific applications. For the purpose of this book, some examples of capabilities are reported in the following list.

  • Searching sequence database with regular expression patterns.
  • Parsing entries in database (e.g. reading a GenBank file and extracting its sequences).
  • Converting entries from one format to another (e.g. GenBank to FASTA format).
  • Parsing the outputs of sequence analysis tools.
  • Pre-processing and post-processing data for computational biology application (e.g. molecular dynamics simulations).

Getting start with Perl programming

A Perl program can be written using a text editor and saved as a text file. As for awk language, a very short program can be executed directly on the command line. Consider for example this first short Perl program

perl -e  “print ‘Hello Molecular world.’;”

the language interpreter is called using the command Perl followed by flags that specify the mode in which the language is executed.

Among the important flags these three are the most commonly used ones:

-e : direct execution (command line as in the example) of the Perl commands

-w : warning option for errors checking

-d : code debugging

A Perl program saved in a file can be executed by typing the command:

perl namefile.pl.

The file can be made self-executing by adding the line

#! /usr/bin/perl –w

at the beginning of the program and making the file executable using the command chmod (see my article on Unix). In Figure 1 a simple example of self-executable Perl script is reported.

Figure1

Figure 1: Example of a self-executable Perl script.

PERL PROGRAM STRUCTURE

The general structure of a PERL program is given in the following flowchart

Untitled

Figure 2: Flow Chart describing the general structure of a PERL program.

The program starts with the definition of variables in which store the information elaborated by the program. Being PERL a not declarative language this means that variable could be defined in the moment of the first one and the kind of variable is automatically assigned by the contents. However, it is good programming practice to declare the variable at the beginning. This habit can keep the program well structured, and it is very useful to easily debug it. After the variable assignment block usually follows an input block gathering data from external devices as keyboards or hard disks. The program block is the part the perform the data elaboration. The part of the program can be structured in subroutines or functions that perform specific tasks. The last block is the output where the elaborated results are printed, stored or represented graphically on the screen. The program can then return back to the program block or terminate.

Perl scalar variables and operators

Scalar variables are defined using the symbol $ in front of them. Perl does not need a declaration of the variables but it is always good practice to declare variables at the beginning of the script. This help to take track of the variables used in the program.

For example, the simple scalar value “aminoacids” is defined as

$aminoacids, 

and to assign the value of 20 to $aminoacids you can write:

$aminoacids = 20;

The arrays are defined using the symbol @ at the beginning of array name. For example, an array containing the names of the amino acids is defined as @aminoacids. 

The 5th element of this array is then given by

$aminoacids[5]; 

the ‘Ala’ value from hash %aminoacids:

$aminoacids{‘Ala’};

the last index of array @aminoacids:

$#aminoacids;

Global Scalar Variables

PERL has many built-in global scalar variables, and they are very important. One of the most important is $_, the default scalar variable. The following functions and operators work with the $_ variable if you do not explicitly specify the scalar variable on which they are to operate.

  • Pattern-matching operator
  • Translation operator
  • Chop
  • Substitution operator
  • <>
  • Print

CAVEAT: Perl has a very flexible syntax and you can play with it writing codes that are impossible to read by other people or even by yourself after few months. In Figure 3, an extreme example of obfuscated code that can still work is shown. This is one of the many example winner codes of the PERL obfuscated code contest.

ObsfuscatedArrowObsfuscated1

Figure 3: The Camel code: an example of a PERL obfuscated program. The source code of this example (and many others) can be found here.

 Therefore do not obfuscate your Perl scripts but get used commenting it!

OPERATORS

In the following Table, a list of Perl operators is reported together with some example of usage.

Table 1: Some example of mathematic operators.

 Type of Operator

 Operator

 Examples

 Arithmetic  +,- , /, *

**: exponent

++.–: incremental

%: modulus

10+25, 9/3, (34+5)-6*8

3**2

$x++, $t–

8%3

 Comparison  <,<=, ==, >=,>, !=  $x<$y, $x==$y, $Tr != $Fa
 Logical   &&, ||, !  ($t<$y)&&($y<$z)||(!$k)
 Assignment  =, +=  $a=5, $b+=$a

EXAMPLE 1

In this example, a simple program for calculating the roots of a 2 nd order algebraic equation is shown. The program also uses the built-in mathematical square root function: sqrt().

$a = 5;
$b = 3;
$c = 0;
$det = $b*$b - 4*$a*$c;
$x1 = (-$b + (sqrt $det))/(2*$a);
$x2 = (-$b - (sqrt $det))/(2*$a);
print "determinant is: $det\n";
print "solution n1 is: $x1\n";
print "solution n2 is: $x2\n";
exit;

String Variables

A string is a scalar value in Perl, unlike other languages like C. This means you can store a string in a scalar variable. Perl has built-in operators on strings. There are also many built-in functions on strings in Perl.

Table 2: Some example of string operators.

Operator Type Operator

Example

Concatenation . “Romeo”.”and Juliet”
Comparison lt,gt,le,ge,eq,ne “many” gt “few”
Repetition x “Ciao”x4
Assignment =, .= $st=”Romeo”;$st.=”and Juliet”
Intepolation “$string” $s=”dna”;$t=”type=$s”;

EXAMPLE 2

In this example, two string sequences are assigned and printed using the concatenation, interpolation operators.

$aa1 = abc;
$aa2 = def;
$pept = $aa1.$aa2;
$pentapept = "ok"x5;
print "the sequence is: $pentapept\n";
exit;

EXAMPLE 3

This example uses the command reverse to reverse a given string and check if it is a palindrome. It can be used to check a word or a sentence but also nucleic acids or amino acids sequences.


$string = ‘Madam I'm Adam’
$rev = reverse $string;

# $rev now contains a copy of the string in reverse order.

print $rev

If you want to input the  word directly from the keyboard then you can use the commands:

$in = <STDIN>;

chomp ($in);

to read the string in $in and eliminate the newline  (\n) character at the end of it.


print "please enter the phrase\n";
$verse =  〈STDIN〉;
chomp ($verse);
$reverse = reverse $verse;
print "the reverse phrase is: $reverse\n";
exit;

 

Strings Manipulation

PERL contains many commands for the manipulation of the string of characters. Two useful ones are index() and substr(). The first one finds in a given string the occurrence of a substring and returns the occurrence in the first one of the first character of the second one. In the following list of examples, it is shown how to use the two commands to find substrings in the following sequence of characters:

$seq = “sdftdfgkdjxznfdfggjdd”

COMMAND                                      OUTPUT (in bold)

$first=index($seq,”xzn”)                #sdftdfgkdjxznfdfggjdd

$first=index($seq,”dfg”,7)              #sdftdfgkdjxznfdfggjdd

$first=substr($seq,5,6)                    #sdftdfgkdjxznfdfggjdd

$first=substr($seq,14)                     #sdftdfgkdjxznfdfggjdd

EXAMPLE 4


print "please enter the phrase\n";
$verse = 〈STDIN〉;
chomp ($verse);

$first1 = index($verse,"abc");
$first2 = index($verse,"abc",3);
$first3 = substr($verse,3,4);
$first4 = substr($verse,4);

print "result: $first4\n";

ARRAY VARIABLES 

Let consider the array @aminoacids. We can assign the elements of the array in these different ways:

  • @aminoacids = (“Ala”, “Gly”, “Trp”, “Ser”);
  • @aminoacids = qw(Ala Gly Trp Ser);

The single element in the array can be accessed as $aminoacids[0], $aminoacids[1],… $aminoacids[n]. Note that the indices start from 0, not 1. In this case, $aminoacids[0] contains “Ala”.  It is possible to copy a subset of the array to another array using the so call slicing operation.

@aminoacids[3,4,5] is the slice of three elements and it can assign to a new array as: Since the elements are contiguous indices then the slicing can be performed also in this way @3aa=@aminoacids[3,4,5]. Since the elements are contiguous indices then the slicing can be performed also in this way @3aa=@aminoacids[3…5]. The slice can be copied in another array using the command @amino = @aminoacids[2..3]. The array length is obtained from the array as in this example: $len=@aacids.

 Splitting and joining strings

As for the strings, there are many built-in functions that can be used to manipulate arraysThe command  @flds=split  is equivalent to  @flds=split(/\s+/,$_) and it is used to separate the fields in the default variable $_  (that is the same as $s in this case) between spaces and stored the resulting fields in the array named flds.

These are examples for splitting at different separator characters:

@flds=split(/,/)                       # Split $_ by a single comma

@flds=split(/:/,$s)                 # Split $_ by a single colon

@flds=split(/,+/)                     # Split $_ by one or more commas

The command join is used to put together separated field in the array @frags

$seq=join(“”,@frags)             # Join @frags into a string $seq without space.

$gs = “—-”

$seq=join($gs,@frags)         # Join @frags into $seq separated by the string $gs

The following built-in function perfrom other common operations with arrays The function shift can be used to extract from the array the left-most element:

$leftmost=shift(@arr) 

The function pop can be used to extract from the array the right-most element:

$rightmost=pop(@arr) 

The function unshift insert an element from the left:

unshift(@arr,$insertleft)

The function push insert an element to the right:

push(@arr,$insertleft)

Finally, the function reverse reverse the array order.

@rarr=reverse(@arr)

FLOW CONTROL

Flow control is the order in which the statements of a program are executed. The program executes the first statement at the top of the program to the last statement at the bottom, in order, unless told to do otherwise. There are two ways to tell a program to do otherwise:

If-elsif  statement

This statement can be used in different forms:

  • if ( c ) {s1; s2; s3;}
  • if ( c )
    {s1; s2; s3;}
    else
    {s4; s5; s6;}
  • if (c1)
    {s1;}
    elsif (c2)    # note: it is written elsif and not elseif
    {s2;}
    elsif (c3)
    {s3;}
    else
    {s4;}

EXAMPLE

if ($i 〉 0) { print $i;}
if ($bb ~= /A/) {
    print Adenosine”,”\n”; }
elsif ($bb~=/T/) {
{print “Tymine”,”\n”;}

unless is the opposite of if

CONDITIONAL TESTS

If you compare two strings, you should use eq, ne, lt, gt, ge or le. If you compare two numbers, you should use ==, !=, >, >=, <, <= instead.

EXAMPLES OF COMPARISONS

  • $num == 3 (Notice: == NOT =)
  • $str eq “Sunday”
  • $str ne “Saturday”
  • $var =~ /abc/
  • $var != /abc/
  • If (1 == 1) {…}                              #  always trues
  • If (1) {…}                                      # always trues
  • If (1 == 0) {…}                              # always false
  • If (0) {…}                                      # always false
  • unless (1 == 0) {…}                     # cycle not executed (false conditions)
  • while ($n < 5) {$n = $n + 1;}    # $n is incremented till its value is equal to 5
  • while (1) {…}                              # infinity while cycle

ITERATION IN PERL: Loops and loop controls

The iteraction in PERL can be executed using a pletora of loop statments:

  • for ()
  • foreach
  • while (), do-while,
  • until, do-until,

and the Loop can be controlled using the staments:  next, last, continue, redo

The statement for()

The syntax of the command is

for (init_exp; test_exp; iterate_exp)

            {s1; s2; s3;}

 EXAMPLE

In this example, the contents of each element of the array $arr is incremented of one.

for ($i = 0; $i 〈  @arr; $i++)
   {
     $arr[$i]++;
    }

The statement foreach 

The syntax of the command is

foreach $i (@some_list)

            {s1; s2; s3;}

EXAMPLE 

This loop performs the same action as in the previous example. 

foreach $e (@arr)
    {$e++;}

The statements while and do-while 

The syntax of the command is

while (condition)

            {s1; s2; s3;}

EXAMPLE

while ($i 〈 10)
   {
     print “i is still less than 10”;
     $i++;
   }

The syntax of the command  is:

do {s1; s2; s3;}

            while (condition);

EXAMPLE

do {
      print “i is less than 10”;
      $i++;
   } while ($i 〈 10)

The statements until and do-until 

The syntax of the command until is:

until (condition)

            {s1; s2; s3;}

EXAMPLES

until ($i 〈= 10)
{
                   print “i is still less than 10”;
                  $i++;
       }

The syntax of the command do-until is:

do {s1; s2; s3;}

            until (condition);

EXAMPLES

do {
   print “i is still less than 10”;
   $i++;
} until ($i 〈= 10);

BREAKING AND CONTINUING LOOPS: the last and next commands 

while (condition){  

s1;
    if (condition2){last;}
    s2;
   s3;}

while (condition) {

   s1;
  if (condition2){next;}
s2;
   s3}

EXAMPLE

foreach $amino_acid (@aminos) {
  print “$amino_acid”;
  if ($amino_acid eq “H”)
    {
      print “Found it.”;
      last;
    }
}
#
# And with the command next
#
foreach $amino_acid (@aminos) {
   print “$amino_acid”;
   if ($amino_acid eq “H”) {
      print “Found it.”;
      next;
   }
}

SUMMARY EXAMPLE 

This program request entering an amino acid sequence. Then it performs a statistical analysis of the occurrence of single amino acids and writes the output as distribution histograms.


print "Amino acids sequence analyzer.\n";

print "Please enter amino acid sequence:\n";
$seq = 〈STDIN〉;
chomp($seq);

$len = length($seq);
print "Length of sequence: $len\n";

$aa ='ACDRSTWYKNQHILFGPMVE';
@aminoacids = qw(A C D R S T W Y K N Q H I L F G P M V E);

print "Calculating amino acids frequency statistics...\n";

for($i=0;$i 〉= ($len-1);$i++) {
   $val = substr($seq,$i,1);
   $k = index($aa,$val);
   print "$val $k \n";
   if ($k 〉= 0) {
     $acount[$k]++;
   }
   else {
     print "\n";
     print
     print "Unidentified nucleotide: ",$val, " \n";
     print "at sequence position: ",$i+1,"\n";
     exit;
   }
}
print "The sequence statistics:\n";
# Make a histogram plot

for ($k=0;$k 〈=19;$k++) {
  if ($acount[$k]) {
    print "$aminoacids[$k] (acount[$k]) :";
    for ($i=0;$i〈= ($acount[$k]-1);$i++) {
      print "#"
    };
    print "\n";
  }
}
exit;

 

How to replace characters into a given sequence

1) Substitution operator

Syntax: s/SEARCHLIST/REPLACEMENTLIST/SCO

with the Substitution Command Options (SCO)

e          Evaluate the right side as an expression.
g          Replace globally, i.e., all occurrences.
i           Do case-insensitive pattern matching.
m        Treat string as multiple lines.
o          Compile pattern only once.
s          Treat string as single line.
x          Use extended regular expressions.

EXAMPLE 1


# Convert all the nucleotide from lower case to upper case
# in the sequence $seq.
$seq =~ s/a/A/g
$seq =~ s/c/C/g
$seq =~ s/g/G/g
$seq =~ s/t/T/g

# or
$seq =~ s/[a-z]/[A-Z]/g

# Converts a DNA sequence to an RNA sequence
# by changing all Thyamines (T) in Uracyls (U)

$seq =~ s/T/U/g

2) Translate operator

Syntax: tr/SEARCHLIST/REPLACEMENTLIST/SCO

with the Substitution Command Options (SCO):

c:   complement (invert) the search list;
d:   delete found but unreplaced characters;
s:   squash duplicate replaced characters.

EXAMPLES

# Upper case to lower case
$var =~ tr/A-Z/a-z/;
# Count the stars in $_
$count = tr/*/*/;
# count the stars in $sky
$count = $sky =~ tr/*/*/;
# count the digits in $_
$count = tr/0-9//;

EXAMPLE

The following program to input a DNA sequence (or to stop with the sequence “quit”) and on the sequence it

  • calculate the occurrence of bases and draw a simple histogram;
  • the reverse of the sequence;
  • the complement of the sequence;
  • the reverse complement.
  • transcribe a DNA sequence in the corresponding RNA.

These tasks can be simply accomplished using the command  (tr)

$tras =~ tr/ACGT/ACGU/;

with the help of the translation Table 1.

Table 1

Symbol Name Base  Compl. base
A Adenine A T
T Thymidine T A
U Uridine(RNA only) U A
G Guanidine G C
C Cytidine C G
Y pYrimidine C T R
R puRine A G Y
K Keto T/U G M
M aMino A C K
B not A C G T V
D not C A G T H
H not G A C T D
V not T/U A C G B
N Unknown ACGT N
#!/usr/bin/perl -w
#
# Initialize variables
#
$dna="";
$a=0;$g=0;$c=0;$t=0;

while ($dna ne "quit") {
   print "Give sequence to be analysed : \n";
   $dna= 〈STDIN〉;
   chomp($dna);

if ($dna ne "quit") {
#
#  Capitalize the nucleotide symbols
#
   $dna=~s/a/A/g;
   $dna=~s/c/C/g;
   $dna=~s/g/G/g;
   $dna=~s/t/T/g;
   $l=length($dna);
#
# Count the nucleotides
#
   for ($i=0;$i〈$l;$i++)
   {
    $d = substr($dna,$i,1);

    if ($d eq "A") {$a++;}
      elsif ($d eq "G") {
      $g++;
    }
    elsif ($d eq "C") {
      $c++;
    }
    elsif ($d eq "T") {
      $t++;
    }
   }
#
# Print the histograms
# % of occurrence in parenthesis
#
   print "A ($a %) :";
   for ($i=0;$i〈$a;$i++) {
     print "#";
   }
   print "\n";

   print "T ($t %):";
   for ($i=0;$i〈$t;$i++) {
     print "#";
   }
   print "\n";

   print "G ($g %):";
   for ($i=0;$i〈$g;$i++) {
     print "#";
   }
   print "\n";

   print "C ($c %):";
   for ($i=0;$i〈$c;$i++) {
     print "#";
   }
   print "\n";
#
# Find the complementary DNA sequence to the given one
#
   $revcomp=$dna;
   $revcomp=~tr/ACGT/TGCA/;
   $reverse=reverse$revcomp;
   print " Complementary: $revcomp \n";
   print " Complementary reversed: $reverse \n\n";
#
# Convert the DNA sequence to the corresponing RNA sequence.
#
$rna=$dna;
$rna=~s/T/U/g;
print " DNA sequence: $dna \n";
print " RNA sequence: $rna \n\n";
  }

}
exit;

PATTERN MATCHING 

A pattern is a sequence of characters to be searched for in a string. A pattern is defined using the regular expression. In Perl, patterns are normally enclosed in slash characters: /def/. The pattern matching is performed using the special operators: =~ and !~ in this ways:

$var =~ /PATTERN/cgimosx

Ex.: $var =~ /abc/;

OR

$var !~ /PATTERN/cgimosx

Ex.: $var !~ /abc/

EXAMPLE


# Prints all lines containing ATG.
# Again, $_ is the special default
# variable.
while(〈〉) {
   if (/ATG/) {
      print;
   }
}
#
# alternatively
#
while(〈〉) {
if ($_ =~ /ATG/) {
print $_;
}
}

REGULAR EXPRESSIONS

A regular expression is a string of characters that may match many different strings, because of the usage of meta-characters in the regular expression. Regular expressions are essentially a tiny, highly specialized programming language itself. It has been built in Unix shell and several languages, such as awk, and Python. PERL uses the regular expressions for pattern matching and substitution.

Regular Expression Operators

Operator     Example           Explanation               

.                     /A…G/               Matches A, followed three letters, and then G
[]                   /[ACT]/              Matches a single A, C, or T
^                    /[^AC]/              Matches a single other than A or C
+                    /AC+G/              Matches an A or more Cs, and then G
?                    /AC?G/               Matches an A or 1 Cs, and then G
{}                   /AC{5,10}T/      Matches an A followed by 5 to 10 Cs, and then a T
–                     /[a-z]/                Matches any lower-case letter
|                    /GT|AG/            Matches GT or AG
()                   /(CGG)*              Matches 0 or more repeats of CGG
^                    /^>/                     Matches “>” at the beginning of the string
$                    /GT$/                  Matches GT at the end of the string
*                    /A*/                     Matches 0 or more As

Reading and writing files

 We have already used the standard input (keyboard) is indicated using the special file handler  STDIN

EXAMPLE

$line =<STDIN>;

If the input is read from standard input or from the file(s) specified on the command line the following syntax is used.

$line = <>;

For the standard output, the file handler STDOUT is used. STDOUT is the default, therefore you do not have to specify it.

EXAMPLE

 print “>”, $head, “\n”, “$seq”;

Prints a sequence in FASTA format to the standard output (usually the screen).

Read command line files

myperl.pl dna.dat protein.dat

@ARGV

Perl variables with the list of the command line arguments


$ARGV[0] = dna.dat
$ARGV[1] = protein.dat
open(FILE,$ARGV[1]);

File Handles

Use it if you want to read from a file not given on the command line or want to write in the file different from standard output one (>).

EXAMPLE

open(IN, “r.txt”);
$line=;
close(IN);

open(OUT, “〉w.txt”);
print OUT $line;
close(OUT);

open(AP, “〉a.txt”);
print AP $line;
close(AP);

 EXAMPLE

#!/usr/bin/perl -w
#Reading protein sequence data from a file
…
open(PROTEINFILE, "/home/pippo/file.dat")
or die "Can't open the input file”;
…
# First line
$protein = 〈PROTEINFILE〉;
#
# Chomp off the new line (\n) character if present at the
# end of $protein
#
CHOMP($protein);
…
# Second line
$protein = 〈PROTEINFILE〉;
…
close (PROTEINFILE);

EXAMPLE

 Read the following  FASTA file

sample dna | (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgacaactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggccatccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctatcggcacaagaagtcacgggagcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggagggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtcagggacaggggttggggccatgcttgctcggggctctgcttcgccccacaaatcctctccgcagcccttggtggccacacccagccagcatcaccagcagcagcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggcatgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaagaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagtgccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctcaccagtgacgccctcagagtccctgccaaggccccgccggccactgcccacacacctgagccactctcagatgaggaccta

Note: $/ and $\ are built-in global variables what defines the input and output separators. The default value of $/ is “\n”.

EXAMPLE

#!/usr/bin/perl -w
#Reading protein sequence data from a file
…
open(PROTEINFILE,"/home/pippo/file.dat");
…
$/ = "";
…
$protein = 〈PROTEINFILE〉;
…

Checking for file existence

#!/usr/bin/perl -w
# Reading protein sequence data from a file,
# First we have to "open" the file, and in the case the open fails,
# print an error message and exit the program
# Notice: try to change the file name (not the directory
# name) and see what happens.
unless (open(PROTEINFILE, “/home/pippo/file.dat"))
{
print "Could not open the file.\n"; exit;
}
# Read the protein sequence data in a "while" loop
# print each line as it is read
while ($protein = 〈PROTEINFILE〉) {
  print "#### Here is the next line of the file:\n";
  print $protein;
}
# Close the file
close〈PROTEINFILE〉
exit;

Reading formatted input

$seq = “sdftdfgkdjxznfdfggjdd”

$first=substr($seq,5,6)             sdftdfgkdjxznfdfggjdd

$first=substr($seq,14)               sdftdfgkdjxznfdfggjdd

EXAMPLE

Read a coordinate file in Protein Data Bank (PDB) format

PDB coordinate format:

COLUMNS     DATA TYPE               DEFINITION

1 – 6               Record Name          “ATOM” or “HETATM”
7–11               Integer                       Atom Number
12                   space
13–16             Atom                          Atom Name
17                   Character                  Alternate location indicator
18–20            Characters                 Residue Name
21                   space                         Character chain identifier
23-26             Integer                       Residue sequence number
27                   Char                           Code for the res insertion
31-38             Real(8.3)                     x
39-46             Real(8.3)                     y
47-54             Real(8.3)                     z
55-60             Real(6.2)                     Occupation factor
61-66             Real(6.2)                     Temperature factor


#! /usr/bin/perl -wT
# Read and write a PDB file
###################################################################

my @number;
my @element;
my @residue;
my @chain;
my @resnum;
my @x;
my @y;
my @z;
my @bf;
my @bf1;
my @bf2;
my $rc = 0;
my $recordtype;
my $nrecordtype;
my $resname;
my $j;
my $chainn;
#
# Open PDB file
#
unless (open(PDBF,"<","test.pdb")) {
print "Couldn't open the pdb file. \n)";
};

my(@atomrecord) =;
foreach my $record (@atomrecord) {
$recordtype = substr($record, 0,6); # columns 1-6
if ($recordtype =~ 'ATOM' || $recordtype =~ 'HETATM' ) {
$number[$rc] = 1*substr($record, 6, 5); # columns 7-11
$element[$rc] = substr($record, 11, 5); # columns 12-16
$residue[$rc] = substr($record, 16, 4); # columns 17-20
$chain[$rc] = substr($record, 20, 2); # columns 21-22
$resnum[$rc] = 1*substr($record, 22, 4); # columns 23-26
$x[$rc] = 1.0*substr($record, 30, 8); # columns 31-38
$y[$rc] = 1.0*substr($record, 38, 8); # columns 39-46
$z[$rc] = 1.0*substr($record, 46, 8); # columns 47-54
$rc++;
}
close(PDBF);
}

Writing Formatted Output 

The printf command

Syntax: Printf

EXAMPLE

my $first = '3.14159265';
my $second = 76;
my $third = “Ciao!";
printf STDOUT "A float: %6.4f An integer: %-5d and a string: %s\n", $first, $second, $third;

This code snippet prints the following:

A float: 3.1416 An integer: 76 and a string: Ciao

EXAMPLE

This example shows how to use the printf command to write a  formatted PDB file.


$nrecordtype= 'HETATM';
$resname = "LIG";
$chainn = 'Z';

unless (open (PDB1, ">", 'out1.pdb')){
print "Couldn't open the output file. \n)";
};

for ($j = 0; $j < $rc; $j++) {

#
# Print PDB file
#

printf PDB1 "%-6s%5u%-5.5s%4.4s%-s%4 %8.3f%8.3f%8.3f\n",$nrecordtype,$number[$j],$element[$j],$resname, $chainn,$resnum[$j],$x[$j],$y[$j],$z[$j];
}
close (PDB1);

 

Associate Array (Hash) Variables

 A hash is essentially an array except that is indexed by user-defined keys rather than by nonnegative integers. A hash should be used, instead of array, when one wishes to set up a lookup table to store key-value pairs. Entire hashes are denoted by ‘%‘.

EXAMPLES

  • %aacids = (key1, val1, key2, val2 …);
  • %aacids = (key1=>val1,key2=>val2);
  • %aacids = (“Ala”, “Alanine”, “Gly”, “Glycine” …);
  • $aacids{‘Ala’} = Alanine;
  • %acids = (“Ala” => “Alanine”, “Gly” => “Glycine, …);

Example of the use of Hash Arrays

#! /usr/bin/ perl –w
# define hash with base name
%DNAbases =(A, Adenine, G, Guanine, C, Cytosine, T, Thymine)
# Input Sequence
@DNA[0] = (‘T’,’A’,‘G’,’C’);
# Write the composition
$i=0;
while ($i 〈 3) {
  printf “%s %s\n”,$DNAbases{$DNA[$i]}, $DNA[$i];
  $i++;
}
exit;

SUBROUTINES 

A section of a program that performs a particular task. The term subroutine is synonymous with procedure, function, and routine.

  • Shorter programs, since you’re reusing the code.
  • Easier to test, since you can test the subroutine separately.
  • Easier to understand, since it reduces clutter and better organizes programs.
  • More reliable, since you have less code when you reuse subroutines, so there are fewer opportunities for something to go wrong.
  • Faster to write, since you may, for example, have already written some subroutines that handle basic statistics and can just call the one that calculates the mean without having to write it again. Or better yet, you found a good statistics library someone else wrote, and you never had to write it at all.

To declare subroutines

sub NAME {
my($var,$var1,…) = @_;
return ($result);

# alternatively
my($var) = @_[0];
my($var1)=@_[1];
return ($result);
} 

Local variables in the subroutinesare defined using the statement:

my($var,@var1);

EXAMPLE


$dna = "AAAA";
$result = A_to_T($dna);
print "I changed all the A's in $dna to T's and got $result\n\n";
exit;

###################################################
# Subroutine
###################################################
sub A_to_T {
my ($input) = @_;
$dna = $input;
$dna =~ s/A/T/g;
return $dna;
}

If correctly typed, the program should output the following message:

I changed all the A’s in TTTT and got TTTT !

Calling subroutines

 NAME (LIST);    

Parentheses optional if predeclared/imported.

&NAME (LIST);        

Makes current @_ visible to called subroutines.

EXAMPLE

 The argument being passed in is $DNA; the result is saved in $cDNA

$cDNA = ComplementDNA($DNA);

EXAMPLE

# main program

$seq1=“aacctgaatg”;
$seq2=“atcgtgagtg”;
print percent_identity($seq1, $seq2);
exit;

# Subroutine

sub percent_identity {
   $seq1 = $_[0];
   $len1 = length $seq1;
   $seq2 = $_[1];
   $len2 = length $seq2;
   $num_mismatches = 0;
   for $i (0..$len1-1) {
     if (substr($seq1, $i,1) ne substr($seq2, $i, 1)
     {$num_mismatches++;
   }
}
return ($num_mismatches*100/$len1);
}

Pass-by-value with arrays


#!/usr/bin/perl -w
# Example of problem of pass-by-value with two arrays

use strict;
my @i = ('1', '2', '3');
my @j = ('a', 'b', 'c');

print "In main program before calling subroutine: i = " . "@i\n";
print "In main program before calling subroutine: j = " . "@j\n";

@res=reference_sub(@i, @j);

print "In main program after calling subroutine: i = " . "@i\n";
print "In main program after calling subroutine: j = " . "@j\n";
exit;

#################################
############ Subroutine #########
#################################

sub reference_sub {
my (@i, @j) = @_;
print "In subroutine : i = " . "@i\n";
print "In subroutine : j = <a href="mailto:%22%20.%20%22@j%5Cn">" . "@j
}

Pass-by-reference

 To pass a parameter by reference, you have to preface the name of the parameter with a backslash. \@i is a reference to array @i. In the subroutine, $i gets the value of \@i. So it is also a reference to array@i. When argument variables are passed in this way, changes in values of the argument variables in the subroutine also affects the values of the arguments in the main program.

 EXAMPLE

#!/usr/bin/perl

# Example of pass-by-reference (a.k.a. call-by-reference)

use strict;
use warnings;

my @i = ('1', '2', '3');
my @j = ('a', 'b', 'c');
print "In main program before calling subroutine: i = " . "@i\n";
print "In main program before calling subroutine: j = " . "@j\n";
reference_sub(\@i, \@j);
print "In main program after calling subroutine: i = " . "@i\n";
print "In main program after calling subroutine: j = " . "@j\n";
exit;

#####################################
##########  Subroutine     ##########
######################################

sub reference_sub {
my ($i, $j) = @_;
print "In subroutine : i = " . "@$i\n";
print "In subroutine : j = " . "@$j\n";
# push and shift are built-in functions on arrays
push(@$i, '4');
shift(@$j);
}

 

 

PERL Resources

  • perl.com (O’Reilly site, the source for Perl)
  • perl.org (Official distribution site)
  • cpan.org (Comprehensive Perl Archive Network)
  • bioperl.org (Bioperl project web site)
  • activestate.com/Products/ActivePerl (“The industry-standard Perl distribution for Linux, Solaris, and Windows. ActivePerl contains the Perl language, the Perl Package Manager, (for installing CPAN packages), and complete online help.”)

Tutorial

http://learn.perl.org

Online-documentation/handbook

http://www.perl.com/pub/v/documentation

Perl help from Linux shell

Perldoc, man perl, perl-html pages

Books on Perl

  1.  Larry Wall and Randal L. Schwartz. Programming Perl. O’Reilly Media.
  2.  Sriram Srinivasan. Advanced Perl Programming. O’Reilly Media.
  3.  James Tisdall. Beginning Perl for Bioinformatics. O’Reilly Media.
  4.  James Tisdall. Mastering Perl for bioinformatics. O’Reilly Media.
  5.  Michael Moorhouse, Paul Barry. Bioinformatics Biocomputing and Perl. Wiley.
  6.  Nathan Torkington and Tom Christiansen. Perl cookbook. O’Reilly Media.
  7.  Randal L. Schwartz. Learning Perl. O’Reilly Media.

One thought on “Introduction to the PERL Language

  1. Pingback: PERL Programming II: Applications to Bioinformatics | Danilo Roccatano

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.