Bit performance parameters of SCIT-1 and SCIT-2 systems
SCIT-1
SCIT-2
1. Processors
P-IV Xeon 2.67 GHz
Itanium2 1.4 GHz
2. Peak performance of a single processor
Integer operations per second, 10 9IPS
1.34
5.6
Floating point operations per second, GFLOPS
5.34
5.6
Node system bus performance, GB/s
4.2
6.4
3. Total peak performance of the system
Integer operations per second, 10 9IPS
Floating point operations per second, GFLOPS
Total system bus performance, GB/s
67.2
204.8
4. Linpack performance of the system, GFLOPS
112.5
The performance characteristics of the developed systems SCIT-1 and SCIT-2 are comparable with those of the world?s best systems. They are also among the world?s best systems in mathematical supercomputing construction.
The creation of the SCIT-1 and SCIT-2 cluster systems and their integration and finally launch was possible due to fruitful cooperation of Glushkov Institute of Cybernetics of NAS of Ukraine, the Kyiv-based USTAR scientific and manufacturing company and Intel corporation (International). The partners of the institute delivered the technical support and consulting of the project shown in Fig. 8.17.
The components of the system level software of the cluster support all stages of user-level parallel software development. They also provide execution of processing-intensive problems. They run on all the nodes of the cluster and the control node as well. The operating systems used are Linux. A message Passing Interface (MPI) instead of an SCI is used for programming in the message-passing model. The system level software also includes optimized compilers of C, C++, Fortran languages for parallel programming, fast Math libraries, etc. The powerful hardware, system level, service and specific cluster software integrated in the system is a strong ground for the application level software development.
Choosing Architecture Features for the Supercomputer Project.Based on the above mentioned specific properties of cluster problems, it is possible to state the generalized requirements to the node of a cluster for effective solution of parallel problems:
? the productivity of the node must linearly depend on the power of the processor, and the productivity of the processor ? on the frequency of the main memory bus used and the amount of main memory accessible in the node (to some reasonable degree);
? interprocessor data exchange must be faster than interconnect exchange, i.e. it is preferable to use multiprocessor nodes (with 2 - 4 processors) and multicore processors;
? the productivity of a node must depend on the interconnect used; two features important here are latency, (i.e. delay arising up at the transmission of minimum package between nodes) and maximal carrying capacity;
? the productivity of a node must depend on the intensity of input-output operations with storage devices
Pipeline and systems calls.As a rule cyclic algorithms are not used for parallel problems executed by a computing node. Therefore classic architecture with a short pipeline, used in AMD processors, is much preferable than Intel P4 processor architectures. Every reference to the data of a neighboring process is accompanied by a few transitions in the kernel node of the processor. The price for this transition is the drop of processing speed by 120-240 times in AMD processors and 1100-1300 times in Intel P4 processors.
HyperThreading. We remind that HyperThreading is a technology of emulating two (and more) processors in the same processor core. As a rule, only one user application may be run here. However, if at least two processors are run (i.e., fragments of a machine code designed to solve a problem), this is HyperThreading. In HyperThreading, each successive process is run on a virtual processor realized with the emulating processor. In contrast to this technology, multicore technology is that in which processors are not virtual or emulation takes place additionally in several processors. Due to the idle time of one of the pipelines in an incorrectly predicted transition or just impossibility of parallel execution of instructions in the P4 architecture, there is a possibility of the use of idling resources as a virtual processor (HyperThreading), but in parallel problems it results only in falling of productivity. The explanation is simple - data exchange between nodes aligns the productivity of all processes at the slowest speed and, as in a virtual processor there is no more than 40% of a real processor, the general productivity falls 2-3 times, i.e. this possibility for clusters is practically unavailable.
Presently the productivity of the existing cluster systems SCIT is sufficient only for simultaneous execution of a few problems, therefore, to meet the current needs of Institutes of NAS of Ukraine it is necessary to heave up the productivity of supercomputer systems ten-fold or more (up to a few teraflops).
Features of a programmable interface for cluster systems. Local networks in-use such as the multiprocessor computer systems of ??? are named clusters of workstations or just clusters, as well as any special connection of computers incorporated for solving a single task. Local networks specially collected to be used as a multiprocessor computer system, placed compactly in one or a few closets, are named clusters of dedicated workstations. For cluster systems the library of Message Passing Interface (MPI) standard is the basic tool of concurrent programming - interface of messages passing [46, 50, and 53].
A parallel program initially contains codes for all branches, but the loader starts the specified quantity of program copies. Thus, each copy of the program determines the sequence number, and, depending on this number and the general size of the calculation field, executes one or another branch of the algorithm. As each branch has its own information space, fully isolated from other branches, the branches exchange information only by passing messages in the operating concurrent programming environment. Note that the start of an MPI-application is carried out by the control workstation; therefore the started file of the MPI-application must be accessible in every calculating module on that absolute path from the control workstation.
MPI, as a programming tool for providing connections between the branches of parallel applications, provides a single mechanism for cooperation branches in parallel applications regardless of machine architecture (uniprocessor/multiprocessor with general/separate memory) and user software interface. MPI procedures are subdivided into: general procedures providing initiation and completion of processes and also service functions; procedures of reception/message passing; procedures of collective inter-process communication; procedures of processes synchronization; and procedures for working with groups of processes. A process is the execution of a program on one processor, which MPI is set on, irrespective of whether this program contains inwardly parallel branches or I/O operations or just a sequential program code.
The basic difference of MPI standard from its predecessors is the concept of a communicator. The communicator determines the context of message passing. Messages using different communicators do not have influence on each other and do not cooperate. All synchronization and message passing operations are localized in the communicator. A group of processes contacts the communicator. The group is a collection of processes each of which has a unique identifier for cooperation with other processes of the group by means of the group communicator. In particular, all collective operations are called simultaneously on all processes of this group. The group communicator realizes information exchanges between processes and their synchronization. Actually, the communicator is used as a cooperation communication environment for the applied group. Each group of processes uses a separate communicator. Processes in a group usually have successive numbers from 0 to n-1, where n is the number of processes in the group. However, the MPI logic allows entering another numerations for the branches of parallel programs, here each branch can have a different number in a different numeration system proper to different communicators. As communications between processes are encapsulated in the communicator, the libraries of parallel programs may be created on the basis of MPI. MPI library allows each function to have a second name. Therefore, if the function is called by one name already, the second name is temporaryly forgotten. This approach is called the tools of weak symbols.