1.4 Run-Time E nvironment
In the HP XC environment, LSF-HPC, SLURM, and HP-MPI work together to provide a
powerful, flexible, extensive run-time environment. This section describes LSF-HPC, SLURM ,
and HP-MPI, and h o w these components work together to provid e the HP XC run-time
environment.
1.4.1 SLURM
SLURM (Simple Linux Utility for Resource Management) is a resource management system
that is integrated into the HP XC system. SLURM is suitable for use on large and s m a ll
Linux clusters. It was developed by Lawrence Livermore National Lab and Linux Networks.
As a resource manager, S LURM allocates exclusive or non-exclusive access to resources
(application/compute nodes) for users to perform work, and provides a framework to start,
execute and monitor work (normally a parallel job) on the set of allocated nodes. A SLURM
system consists of two daemo ns, one configuration file, and a set of commands and APIs. The
central controller daemon, slurmctld, maintains the global s tate and directs operations. A
slurmd daemon is deployed to each computing node and responds to job-related requests,
such as launching jobs, sig nalling , and terminating jobs. E nd users and system software (such
as LSF-HPC) communicate withSLURMbymeansofcommandsorAPIs—forexample,
allocating resources, launching parallel jobs on allocated resources, an d killing running jobs.
SLURM groups compute nodes (the nodes where jobs are run) togeth er into partitions.The
HP XC system can have one or several partitions. When HP XC is installed, a single par tit ion
of com pute nodes is created by default for LSF batch jobs. The system administrator has the
option of cr eating addition a l partitions. For example, another partition could be created for
interactive jobs.
1.4.2 Load Sharing Facility (LSF-HPC)
The Load Sharing Facility for High Performance Computing (LSF-HPC) from Platform
Computing Corporation is a batch system resource manager that has been integrated with
SLURM for use on the HP XC system. LSF-HPC for SLURM is included with the HP XC
System Softw are, and is an integral part of theHP XC environment. LSF-HPC interacts with
SLURM to obtain and allocate available resources, and to launch and control all the jobs
submitted to LSF-HPC. LSF-HPC accepts, q ueues, schedules, dispatches, and controls all the
batch j obs that users submit, according to policies and configurations established by the HP
XC site administrator. On an HP XC system, LSF-HPC for SLU RM is installed and runs on
one HP XC node, known as the LSF-HPC execution host.
A complete description of LSF-HPC is provided in Chapter 7. In addition , for your convenience,
the HP XC documentation CD contains LSF Version 6.0 manuals from Platform Computing.
1.4.3 How LSF-HPC and SLURM Interact
In the HP XC environment, LSF-HPC cooperates with SLURM to combine LSF-HPC’s
powerful scheduling functionali ty with SLURM’s scalable parallel job launching capabilities.
LSF-HPC acts primarily as a w orkload scheduler on top of the SLURM system, providing
policy and topology-based scheduling for end users. SLURM provides an execution and
monitori ng layer for LSF-HPC. LSF-HPC uses SLURM to detect system topology information,
make scheduling decisions, and launch j obs on allocated resources.
When a job is submitted to LSF-HPC, LSF-HPC schedules the job based on job resource
requirements and communicates with SLURM to allocate the required HP XC com pu te nodes
for the job from the SLURM lsf partition. LSF-HPC provides node-level scheduling for
parallel jobs, and CPU-level scheduling for serial jobs. Because of node-level scheduling, a
parallel job may be allocated more CPUs than it requested, depending on its resource request;
the srun or mpirun -srun launch commands within the job still honor the original CPU
1-6 Overview o f the User Environment
Komentáře k této Příručce