The recent technical developments in parallel computing have made available large parallel computing facilities at relatively contained costs that have stimulated an intense developing activity to realize efficient parallel programs for the scientific and technical calculation. One of the scientific fields that most benefit from these developments is computational chemistry and in particular Molecular Dynamics (MD). The reason of that is related to the increasing interest in the study, with this technique, larger molecular systems for long simulation times.

## The parallel computing

The term parallel computing refers to a calculation approach that distributes the task to solve a particular computational problem by dividing and distributing the calculation task to different processors operating at the same time. The computers in which this type of calculation is executed are named in different ways: parallel computers, computer clusters, multiprocessors.

Each of these names suggests a particular type of computer architecture. In this context, we will use the general name parallel computer in order to indicate a computer having more processors. Every processing unit will be called *computing node*. The parallel computers can be classified, based on the way by which the instructions and the data are elaborated, in two fundamental computer architecture:

- single instruction stream multiple data stream, (SIMD),
- multiple instruction streams multiple data stream, (MIMD).

SIMD computers are composed of several nodes that are synchronized by a common clock and executing at the same moment the same instruction but on different data distributed in the memory of the single nodes. On the contrary, MIMD computers are based on a more general and flexible architecture.

On the contrary, MIMD computers are based on a more general and flexible architecture. Normally, they are constituted by more complex synchronized processors that can execute independently the own task on the distributed data and communicate with the other processor when it is necessary. In a general, these two types of architectures can be constituted by hundred to thousands of processors each one having its own local memory. The simplicity of SIMD architecture provides, at parity of number and power of the processors, higher performances than MIMD architecture. However, it involves some technical limitations (for example the impossibility to address on some SIMD computer, to address locally the index of arrays) renders programming more difficult.

Several algorithms for molecular dynamics simulations optimized to run on MIMD architectures having from few tens of processors to hundreds or thousands of processors have been proposed. These algorithms are nowadays implemented in packages for molecular dynamics simulation of large biomolecular systems, as Gromacs, NAMD, AMBER, CHARMM, and GROMOS. There are also several MD packages for the simulation of complex molecular systems that can be run on SIMD architectures.

## Parallel algorithm for MD simulations

In MD programs, the calculation of the non-bonded forces is the most demanding part.

In fact, it can take, depending on the method used, until 90 % of the total calculation time. For this reason, in programming a code for MD simulation running on a parallel calculator, it is necessary to invent efficient algorithms that can speed up the calculation of these interactions.

A bottleneck in developing efficient parallel computer program are the communication between the nodes. A good computer program should reduce as possible the number of communications to get better performances.

Two are the possible algorithm strategies to tackle with the problem of the non-bonded interactions.

### Spatial decomposition of the simulated system.

In order to carry out the spatial decomposition of the molecular system then it is necessary partitioning the atoms in the simulation box between the rectangular sub-boxes (equally divided among the nodes that are being used), containing a similar number of atoms (Figure I).

**Figure I:** *Subdivision of the molecular system in boxes having a similar atom number.*

The atoms contained in each subbox are assign to a corresponding computer processor that will calculate the interactions between them. In the next step, adjacent processors communicate between them the information concerning atoms that are located in the edges of the respective subboxes. The partition of the atoms

in each subbox according to the position with respect the subbox border, substantially reduce the number of data trasfer among the processors improving the performance of the calculation.

### Data Decomposition method

The data decomposition consists in dividing the information of the simulated system between the processors inrespective of the actual spatial position of the single atoms. This approach is far less efficient than the spatial decomposition. However is very simple to implement and on shared memory computer can still perform very well.

The *systolic loop* (SL) method is one of the most efficient methods to calculate non-bonded interactions on both SIMD and MIMD systems using data decomposition approach. In Figure II, the principle of the method is schematically represented. The *P* nodes of the computer are progressively numbered and the communication occur at each cicle of the process between the next atom in the sequence. At each node of the computer, a group of atoms (*stationary* group) is assigned. In each phase of the loop, nodes calculate the interactions between the stationary group of atoms and a group of atoms that has been passed from the previous node (*transient*$ group).

After the calculation is finished, the information (position, velocities, forces) of the transient group are trasfered to the next node in the sequence, and a new calculation phase takes place. Since also the forces are transferred between the nodes than it not necessary to transfer the transient groups of atoms to all nodes of the computer, but for the Newton’s third law, it is sufficient to perform *P/2 *communications if *P* is even. This strategy of parallelization has been adopted also in the Gromacs program.

**Figure II.** *The systolic loop. Each square represents a node. The number in the lower square indicates the stationary group of atoms in that node while the number on the top of the square indicates the transient group of atoms. The number in parenthesis on the right indicates the pahse of the loop. The nodes marked with “X” in the cycle are forced to remain inactive in order to avoid to duplicate the calculation of the interactions already carried out in the first three nodes.*

## Developing of a parallel MD program on the APE computers

I have contributed to the development a cutoff based algorithm for the efficient computation of non-bonding interactions on single instruction memory distributed (SIMD) parallel computers [1]. The algorithm contains several innovative solutions allowing the use of non-bonded pair list method on this type of machines. The algorithm was implemented as hybrid code on the parallel computer Quadrics (a commercial version of APE 100 computer, developed by the Instituto Nazionale di Fisica Nucleare) using the TAO language and on a SUN SPARC in FORTRAN (using the MD package GROMOS87).

#### THE TAO CODE OF THE PROJECT

Two examples of subroutines used in the program developed for the project are reported to givean idea about the TAO language.

!!################################################################# subroutine read_coord() !!################################################################# !! !! Read atomic coordinates from file "simcoor.dat" !! open for reading "simcoor.dat" as 6 read from 6 [0,0,0][7,3,3] multidata att[0].x:nread1 close 6 open for reading "simcoor_cm.dat" as 6 read from 6 [0,0,0][7,3,3] multidata att_cm[0].x:nread1 close 6 do n=0,natl-1 atl[n]=att[n] atl_cm[n]=att_cm[n] enddo end !!################################################################## subroutine run_md_l() !!################################################################## !! !! Perform MD simulation using long cut-off !! register mpoint_nul point1 register point point2 !! !! Make zero the elements of some arraies and records !! itemp=0 while (itemp<natl-20) extract point1 from m_nul.[0] replace point1 into force[itemp] replace point1 into force_t[itemp] itemp=itemp+21 endwhile while (itemp<natl) point2=point_nul force[itemp]=point2 force_t[itemp]=point2 itemp=itemp+1 endwhile !! !!#################### Compute interactions !! !! !! First loop !! ifase=0 natf=natl-2 call calc_forces_noGC() natf=natl-1 do I_loop=0,3 call make_loop_l() call calc_forces_noGC() enddo !! !! Shifts & Loops !! I_loop=12 call make_loop_l() call calc_forces_noGC() do I_loop=4,11 call make_loop_l() call calc_forces_noGC() enddo !! !! Reordering Shifts !! I_loop=13 call make_loop_l() !! !! Make the sum of transient data with the local ones !! call sum_forces() end

### REFERENCE

**Roccatano**, R. Bizzarri, G. Chillemi, N. Sanna, A. Di Nola.*Development of a parallel molecular dynamics code on SIMD computers: An algorithm for the use of the pair list criterion.*J. Comp. Chem.,**19**, 685-694 (1998).