Cray XC40 (Conrad)
User Guide
Table of Contents
- 1. Introduction
- 1.1. Document Scope and Assumptions
- 1.2. Policies to Review
- 1.2.1. Login Node Abuse Policy
- 1.2.2. Workspace Purge Policy
- 1.3. Obtaining an Account
- 1.4. Requesting Assistance
- 2. System Configuration
- 2.1. System Summary
- 2.2. Processors
- 2.3. Memory
- 2.4. Operating System
- 2.5. File Systems
- 2.5.1. /p/home/
- 2.5.2. /p/work1/
- 2.5.3. /p/cwfs/
- 2.6. Peak Performance
- 3. Accessing the System
- 3.1. Kerberos
- 3.2. Logging In
- 3.3. File Transfers
- 4. User Environment
- 4.1. User Directories
- 4.1.1. Home Directory
- 4.1.2. Work Directory
- 4.1.3. Center Directory
- 4.2. Shells
- 4.3. Environment Variables
- 4.3.1. Login Environment Variables
- 4.3.2. Batch-Only Environment Variables
- 4.4. Modules
- 4.5. Archive Usage
- 4.5.1. Archival Command Synopsis
- 5. Program Development
- 5.1. Programming Models
- 5.1.1. Message Passing Interface (MPI)
- 5.1.2. Shared Memory (SHMEM)
- 5.1.3. Open Multi-Processing (OpenMP)
- 5.1.4. Hybrid Processing (MPI/OpenMP)
- 5.1.5. Partitioned Global Address Space (PGAS)
- 5.1.6. Accelerated Processing (Offload, Native, Native/Symmetric)
- 5.2. Available Compilers
- 5.2.1. Cray Compiler Environment
- 5.2.2. Portland Group (PGI) Compiler Suite
- 5.2.3. Intel Compiler Environment
- 5.2.4. GNU Compiler Collection
- 5.3. Relevant Modules
- 5.4. Libraries
- 5.4.1. Cray LibSci
- 5.4.2. Intel Math Kernel Library (MKL)
- 5.4.3. Additional Math Libraries
- 5.5. Debuggers
- 5.5.1. TotalView
- 5.5.2. DDT
- 5.5.3. GDB
- 5.6. Code Profiling and Optimization
- 5.6.1. CrayPat
- 5.6.2. Additional Profiling Tools
- 5.6.3. Program Development Reminders
- 5.6.4. Compiler Optimization Options
- 5.6.5. Performance Optimization Methods
- 6. Batch Scheduling
- 6.1. Scheduler
- 6.2. Queue Information
- 6.3. Interactive Logins
- 6.4. Interactive Batch Sessions
- 6.5. Cluster Compatibility Mode (CCM)
- 6.6. Batch Request Submission
- 6.7. Batch Resource Directives
- 6.8. Launch Commands
- 6.9. Sample Scripts
- 6.10. PBS Commands
- 6.11. Advance Reservations
- 7. Software Resources
- 7.1. Application Software
- 7.2. Useful Utilities
- 7.3. Sample Code Repository
- 8. Links to Vendor Documentation
- 8.1. Cray Links
- 8.2. SUSE Links
- 8.3. GNU Links
- 8.4. Portland Group (PGI) Links
- 8.5. Intel Links
- 8.6. Debugger Links
1. Introductionto top
1.1. Document Scope and Assumptions
This document provides an overview and introduction to the use of the Cray XC40, Conrad, located at the Navy DSRC, along with a description of the specific computing environment on Conrad. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:
- Use of the UNIX operating system
- Use of an editor (e.g., vi or emacs)
- Remote usage of computer systems via network or modem access
- A selected programming language and its related tools and libraries
1.2. Policies to Review
Users are expected to be aware of the following policies for working on Conrad.
1.2.1. Login Node Abuse Policy
The login nodes provide login access for Conrad and support such activities as compiling, editing and general interactive use by all users. Consequently, memory- or CPU-intensive programs running on the login nodes can significantly affect all users of the system. Therefore, only small serial applications requiring less than 15 minutes of compute time and less than 8 GBytes of memory are allowed on the login nodes. Any jobs running on the login nodes that exceed these limits will be terminated.
1.2.2. Workspace Purge Policy
Close management of space in the /p/work1 file system is a high priority. Files in the /p/work1 file system that have not been accessed in 21 days are subject to the purge cycle. If available space becomes critically low, a manual purge may be run, and all files in /p/work1 are eligible for deletion. Using the touch command (or similar commands) to prevent files from being purged is prohibited. Users are expected to keep up with file archival and removal within the normal purge cycles.
Note! If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.
1.3. Obtaining an Account
The process of getting an account on the HPC systems at any of the DSRCs begins with getting an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account." If you do not yet have a pIE User Account, please visit HPC Centers: Obtaining An Account and follow the instructions there. If you need assistance with any part of this process, please contact the HPC Help Desk at accounts@helpdesk.hpc.mil.
1.4. Requesting Assistance
The HPC Help Desk is available to help users with unclassified problems, issues, or questions. Analysts are on duty 8:00 a.m. - 8:00 p.m. Eastern, Monday - Friday (excluding Federal holidays).
- Web: https://helpdesk.hpc.mil
- E-mail: help@helpdesk.hpc.mil
- Phone: 1-877-222-2039 or (937) 255-0679
- Fax: (937) 656-9538
You can contact the Navy DSRC in any of the following ways for after-hours support and for support services not provided by the HPC Help Desk :
- E-mail: dsrchelp@navydsrc.hpc.mil
- Phone: 1-800-993-7677 or (228) 688-7677
- Fax: (228) 688-4356
- U.S. Mail:
Navy DoD Supercomputing Resource Center
1002 Balch Boulevard
Stennis Space Center, MS 39522-5001
For more detailed contact information, please see our Contact Page.
2. System Configurationto top
2.1. System Summary
Conrad is a Cray XC40. The login and compute nodes are populated with Intel Xeon E5-2698v3 (Haswell-EP) processors clocked at 2.3 GHz. Conrad uses a dedicated Cray Aries high-speed network for MPI messages and I/O traffic. Conrad uses Lustre to manage its parallel file system that targets arrays of SAS disk drives.
Conrad has 1,699 compute nodes that share memory only on the node; memory is not shared across the nodes.
Each standard compute node has two 16-core processors that operate under a Cray Linux Environment (CLE), sharing 128 GBytes of DDR3 memory, with no user-accessible swap space.
Each large-memory compute node has two 16-core processors that operate under a Cray Linux Environment (CLE), sharing 256 GBytes of DDR3 memory, with no user-accessible swap space.
Each hybrid node on Conrad, exclusively available via the phi queue, has one 12-core processor with its own Red Hat Enterprise Linux operating system, and 64 GBytes of memory, with limited user-accessible swap space. Hybrid nodes also contain one Intel Xeon Phi 5120D coprocessor, which is comprised of 60 cores and holds 8 GBytes of internal memory.
Conrad is rated at 2.0 peak PFLOPS and has 2.29 PBytes (formatted) of parallel disk storage.
Conrad is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, I/O, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.
Login Nodes | Compute Nodes | |||
---|---|---|---|---|
Standard Memory | Large Memory | Phi Accelerated |
||
Total Nodes | 6 | 1,523 | 8 | 168 |
Operating System | SLES | Cray Linux Environment | ||
Cores/Node | 32 | 12 + 1 Phi (1 x 60 Phi cores) |
||
Core Type | Intel Xeon E5-2698v3 | Intel Xeon E5-2696v2 +Intel Xeon 5120D Phi |
||
Core Speed | 2.3 GHz | 2.4 GHz +1.05 GHz |
||
Memory/Node | 256 GBytes | 128 GBytes | 256 GBytes | 64 GBytes +8 GBytes |
Accessible Memory/Node | 240 GBytes | 125 GBytes | 246 GBytes | 63 GBytes +7.5 GBytes |
Memory Model | Shared on node. | Shared on node. Distributed across cluster. |
||
Interconnect Type | Ethernet | Cray Aries / Dragonfly |
Path | Capacity | Type |
---|---|---|
/p/home ($HOME) | 113 TBytes | Lustre |
/p/work1 ($WORKDIR) | 2.0 PBytes | Lustre |
2.2. Processors
Conrad uses the Intel Haswell E5-2698v3 64-bit processors on its login, standard memory, and large memory compute nodes. These processors are clocked at 2.3 GHz, have 16 cores per CPU and have 16x32 KBytes of L1 instruction cache, 16x32 KBytes of L1 data cache, 16x256 KBytes (4 MBytes) of L2 cache and access to a 40-MByte L3 cache that is shared among all 16 cores of the processor.
Conrad uses the Intel Ivy Bridge E5-2696v2 64-bit processors on its Phi accelerator nodes. These processors are clocked at 2.4 GHz and have 12 cores per CPU. The Phi nodes are also paired with an accelerator blade containing one 60-core Intel Xeon 5120D Phi coprocessor.
2.3. Memory
Conrad uses both shared- and distributed-memory models. Memory is shared among all the cores on a node, but is not shared among the nodes across the cluster.
Each login node contains 256 GBytes of main memory. All memory and cores on the node are shared among all users who are logged in. Therefore, users should not use more than 8 GBytes of memory at any one time.
Each standard compute node contains 125 GBytes of user-accessible shared memory.
Each large-memory compute node, available exclusively via the bigmem queue, contains 246 GBytes of user-accessible shared memory.
Each hybrid node, available exclusively via the phi queue, contains 63 GBytes of user-accessible shared memory on the standard compute portion of the node and approximately 7.5 GBytes on the Intel Xeon Phi portion of the node.
2.4. Operating System
The operating system on Conrad's login nodes is SUSE Enterprise Linux. The compute nodes use a reduced functionality Linux kernel that is designed for computational computing. The combination of these two operating systems is known as the Cray Linux Environment (CLE). The compute nodes can provide access to dynamically shared objects and most of the typical Linux commands and basic functionality by including the Cluster Compatibility Mode (CCM) option in your PBS batch submission script or command. See section 6.5 for more information on using CCM.
2.5. File Systems
Conrad has the following file systems available for user storage:
2.5.1. /p/home/
This file system is locally mounted from Conrad's Lustre file system and has a formatted capacity of 113 TBytes. All users have a home directory located on this file system which can be referenced by the environment variable $HOME.
2.5.2. /p/work1/
This file system is locally mounted from Conrad's Lustre file system and is tuned for parallel I/O. It has a formatted capacity of 2.0 PBytes. All users have a work directory located on this file system which can be referenced by the environment variable $WORKDIR. This file system is not backed up. Users are responsible for making backups of their files to the archive server, Newton, or to some other local system.
Raid/Striping Concerns for Large Files
It is important to note that the /p/work1 file system is a parallel, striped file system. This means that as files are written, they are automatically divided into chunks and written across multiple disk sets, or "OSTs," simultaneously. This process, called "striping," plays a vital role in running very large jobs because it significantly improves file I/O speed, thereby reducing the time required to read or write a file. Without parallel striping, large jobs, many of which require hundreds of GBytes of disk space, would spend much of their time just reading from and writing to disk.
The default stripe size for files in the /p/work1 file system is 1 MByte, and the default stripe count is 4 stripes. Increasing the stripe count is advisable when creating files that are larger than 40 GBytes.
2.5.3. /p/cwfs/
This path is directed to the Center-Wide File System (CWFS) which is meant for short-term storage (no longer than 120 days). All users have a directory defined in this file system which can be referenced by the environment variable $CENTER. This is accessible from the HPC system login nodes and from the HPC Portal. The CWFS has a formatted capacity of 3300 TBytes and is managed by IBM's Spectrum Scale (formerly GPFS).
2.6. Peak Performance
Conrad is rated at 2.0 peak PFLOPS.
3. Accessing the Systemto top
3.1. Kerberos
A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to Conrad. More information about installing Kerberos clients on your desktop can be found at HPC Centers: Kerberos & Authentication.
3.2. Logging In
The system host name for the Conrad cluster is conrad.navydsrc.hpc.mil, which will redirect the user to one of six login nodes. Hostnames and IP addresses to these nodes are available upon request from the HPC Help Desk.
- Kerberized SSH
The recommended method is to use dynamic assignment, as follows:
% ssh -l username conrad.navydsrc.hpc.mil
Alternatively, you can manually specify a particular login node, as follows:
% ssh -l username conrad#.navydsrc.hpc.mil (# = 1 - 6) - Kerberized rlogin is also allowed.
% krlogin -l username conrad.navydsrc.hpc.mil
3.3. File Transfers
File transfers to DSRC systems (except those to the local archive system) must be performed using Kerberized versions of the following tools: scp, mpscp, sftp, and kftp. Before using any Kerberized tool, you must use a Kerberos client to obtain a Kerberos ticket. Information about installing and using a Kerberos client can be found at HPC Centers: Kerberos & Authentication.
The command below uses secure copy (scp) to copy a single local file into a destination directory on a Conrad login node. The mpscp command is similar to the scp command, but has a different underlying means of data transfer, and may enable greater transfer rate. The mpscp command has the same syntax as scp.
% scp local_file user@conrad.navydsrc.hpc.mil:/target_dir
Both scp and mpscp can be used to send multiple files. This command transfers all files with the .txt extension to the same destination directory.
% scp *.txt user@conrad.navydsrc.hpc.mil:/target_dir
The example below uses the secure file transfer protocol (sftp) to connect to Conrad, then uses the sftp cd and put commands to change to the destination directory and copy a local file there. The sftp quit command ends the sftp session. Use the sftp help command to see a list of all sftp commands.
% sftp user@conrad.navydsrc.hpc.mil
sftp> cd target_dir
sftp> put local_file
sftp> quit
The Kerberized file transfer protocol (kftp) command differs from sftp in that your username is not specified on the command line, but given later when prompted. The kftp command may not be available in all environments.
% kftp conrad.navydsrc.hpc.mil
username> user
kftp> cd target_dir
kftp> put local_file
kftp> quit
Windows users may use a graphical file transfer protocol (ftp) client such as Filezilla.
4. User Environmentto top
4.1. User Directories
The following user directories are provided for all users on Conrad.
4.1.1. Home Directory
When you log on to Conrad, you will be placed in your home directory, /p/home/username. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes, and may be used to store small user files. It has an initial quota of 100 GBytes. $HOME is not intended as permanent storage, but files stored in $HOME are not subject to being purged.
4.1.2. Work Directory
Conrad has one large file system, /p/work1, for the temporary storage of data files needed for executing programs. You may access your personal working directory by using the $WORKDIR environment variable, which is set for you upon login. Your $WORKDIR directory has an initial quota of 10 TBytes. Your $WORKDIR and the /p/work1 file system will fill up as jobs run. Please review the Purge Policy and be mindful of your disk usage.
REMEMBER: /p/work1 is a "scratch" file system and is not backed up. You are responsible for managing files in your $WORKDIR by backing up files to the archive server and deleting unneeded files when your jobs end. See the section below on Archive Usage for details.
All of your jobs should execute from your $WORKDIR directory, not $HOME. While not technically forbidden, jobs that are run from $HOME are subject to smaller disk space quotas and have a much greater chance of failing if problems occur with that resource.
To avoid unusual errors that can arise from two jobs using the same scratch directory, a common technique is to create a unique subdirectory for each batch job by including the following lines in your batch script:
TMPD=${WORKDIR}/${PBS_JOBID} mkdir -p ${TMPD}
4.1.3. Center Directory
The Center-Wide File System (CWFS) provides file storage that is accessible from Conrad's login nodes and from the HPC Portal. The CWFS allows for file transfers and other file and directory operations from Conrad using simple Linux commands. Each user has their own directory in the CWFS. The name of your CWFS directory may vary between machines and between centers, but the environment variable $CENTER will always refer to this directory.
The example below shows how to copy a file from your work directory on Conrad to the CWFS.
While logged into Conrad, copy your file from your Conrad work directory to the CWFS.
% cp $WORKDIR/filename $CENTER
4.2. Shells
The following shells are available on Conrad: csh, bash, ksh, tcsh, zsh, and sh. To change your default shell, please email a request to require@hpc.mil. Your preferred shell will become your default shell on the Conrad cluster within 1-2 working days.
4.3. Environment Variables
A number of environment variables are provided by default on all HPCMP HPC systems. We encourage you to use these variables in your scripts where possible. Doing so will help to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems.
4.3.1. Login Environment Variables
The following environment variables are common to both the login and batch environments:
Variable | Description |
---|---|
$ARCHIVE_HOME | Your directory on the archive server. |
$ARCHIVE_HOST | The host name of the archive server. |
$BC_HOST | The generic (not node specific) name of the system. |
$CC | The currently selected C compiler. This variable is automatically updated when a new compiler environment is loaded. |
$CENTER | Your directory on the Center-Wide File System (CWFS). |
$COST_HOME | This variable contains the path to the base directory of the default installation of the Common Open Source Tools (COST) installed on a particular compute platform. (See BC policy FY13-01 for COST details.) |
$CSI_HOME | The directory containing the following list of heavily used application packages: ABAQUS, Accelrys, ANSYS, CFD++, Cobalt, EnSight, Fluent, GASP, Gaussian, LS-DYNA, MATLAB, and TotalView, formerly known as the Consolidated Software Initiative (CSI) list. Other application software may also be installed here by our staff. |
$CXX | The currently selected C++ compiler. This variable is automatically updated when a new compiler environment is loaded. |
$DAAC_HOME | The directory containing DAAC supported visualization tools ParaView, VisIt, and EnSight. |
$F77 | The currently selected Fortran 77 compiler. This variable is automatically updated when a new compiler environment is loaded. |
$F90 | The currently selected Fortran 90 compiler. This variable is automatically updated when a new compiler environment is loaded. |
$HOME | Your home directory on the system. |
$JAVA_HOME | The directory containing the default installation of Java. |
$KRB5_HOME | The directory containing the Kerberos utilities. |
$PET_HOME | The directory containing the tools formerly installed and maintained by the PET staff. This variable is deprecated and will be removed from the system in the future. Certain tools will be migrated to $COST_HOME, as appropriate. |
$PROJECTS_HOME | A common directory where group-owned and supported applications and codes may be maintained for use by members of a group. Any project may request a group directory under $PROJECTS_HOME. |
$SAMPLES_HOME | The Sample Code Repository. This is a collection of sample scripts and codes provided and maintained by our staff to help users learn to write their own scripts. There are a number of ready-to-use scripts for a variety of applications. |
$WORKDIR | Your work directory on the local temporary file system (i.e., local high-speed disk). |
4.3.2. Batch-Only Environment Variables
In addition to the variables listed above, the following variables are automatically set only in your batch environment. That is, your batch scripts will be able to see them when they run. These variables are supplied for your convenience and are intended for use inside your batch scripts.
Variable | Description |
---|---|
$BC_CORES_PER_NODE | The number of cores per node for the compute node on which a job is running. |
$BC_MEM_PER_NODE | The approximate maximum user-accessible memory per node (in integer MBytes) for the compute node on which a job is running. |
$BC_MPI_TASKS_ALLOC | The number of MPI tasks allocated for a job. |
$BC_NODE_ALLOC | The number of nodes allocated for a job. |
4.4. Modules
Software modules are a convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found. Conrad uses "modules" to initialize your environment with COTS application software, system commands and libraries, compiler suites, environment variables, and PBS batch system commands.
A number of modules are loaded automatically as soon as you log in. To see the modules which are currently loaded, use the "module list" command. To see the entire list of available modules, use "module avail". You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.
4.5. Archive Usage
All of our HPC systems have access to an online archival mass storage system that provides long-term storage for users' files on a petascale archival storage system that resides on a robotic tape library system. A 60-TByte disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.
Tape file systems have very slow access times. The tapes must be robotically pulled from the tape library, mounted in one of the limited number of tape drives, and wound into position for file archival or retrieval. For this reason, users should always tar up their small files in a large tarball when archiving a significant number of files. A good maximum target size for tarballs is about 200 GBytes or less. At that size, the time required for file transfer and tape I/O is reasonable. Files larger than 1 TByte may span more than one tape, which will greatly increase the time required for both archival and retrieval.
The environment variables $ARCHIVE_HOST and $ARCHIVE_HOME are automatically set for you. $ARCHIVE_HOST can be used to reference the archive server, and $ARCHIVE_HOME can be used to reference your archive directory on the server. These variables can be used when transferring files to/from archive.
4.5.1. Archival Command Synopsis
A synopsis of the main archival utilities is listed below. For information on additional capabilities, see the Archive User Guide or read the online man pages that are available on each system. These commands are non-Kerberized and can be used in batch submission scripts if desired.
-
Copy one or more files from the archive server
rcp ${ARCHIVE_HOST}:${ARCHIVE_HOME}/file_name ${WORKDIR}/proj1 -
List files and directory contents on the archive server
rsh ${ARCHIVE_HOST} ls [lsopts] [file/dir ...] -
Create directories on the archive server
rsh ${ARCHIVE_HOST} mkdir [-p] [-s] dir1 [dir2 ...] -
Copy one or more files to the archive server
rcp ${WORKDIR}/proj1/file_name ${ARCHIVE_HOST}:${ARCHIVE_HOME}/proj1
5. Program Developmentto top
5.1. Programming Models
Conrad supports five parallel programming models: Message Passing Interface (MPI), Shared-MEMory (SHMEM), Open Multi-Processing (OpenMP), and Partitioned Global Address Space (PGAS): Co-Array FORTRAN, and Unified Parallel C (UPC). A Hybrid MPI/OpenMP programming model is also supported. MPI and SHMEM are examples of the message- or data-passing models, while OpenMP uses only shared memory on a node by spawning threads. PGAS programming using Co-Array FORTRAN and Unified Parallel C shares partitioned address space, where variables and arrays can be directly addressed by any processing element.
5.1.1. Message Passing Interface (MPI)
This release of MPI-2 derives from Argonne National Laboratory MPICH-2 and implements the MPI-2.2 standard except for spawn support, as documented by the MPI Forum in "MPI: A Message Passing Interface Standard, Version 2.2."
The Message Passing Interface (MPI) is part of the software support for parallel programming across a network of computer systems through a technique known as message passing. MPI establishes a practical, portable, efficient, and flexible standard for message passing that makes use of the most attractive features of a number of existing message-passing systems, rather than selecting one of them and adopting it as the standard. See "man intro_mpi" for additional information.
When creating an MPI program on Conrad, ensure the following:
- That the default MPI module (cray-mpich) has been loaded. To check
this, run the "module list" command. If cray-mpich is not listed
and a different MPI module is listed, use the following command to swap the
MPI modules:
module swap other_mpi_module cray-mpich
If no MPI module is loaded, load the cray-mpich module:
module load cray-mpich
- That the source code includes one of the following lines:
INCLUDE "mpif.h" ## for Fortran, or #include <mpi.h> ## for C/C++
To compile an MPI program, use the following examples:
ftn -o mpi_program mpi_program.f ## for Fortran, or cc -o mpi_program mpi_program.c ## for C/C++
The program can then be launched using the aprun command, as follows:
aprun -n mpi_procs mpi_program [user_arguments]
where mpi_procs is the number of MPI processes being started. For example:
#### starts 64 mpi processes; 32 on each node, one per core ## request 2 nodes, each with 32 cores and 32 processes per node #PBS -l select=2:ncpus=32:mpiprocs=32 aprun -n 64 ./a.out
Accessing More Memory Per MPI Process
By default, one MPI process is started on each core of a node. This means that on Conrad, the available memory on the node is split 32 ways. A common concern for MPI users is the need for more memory for each process. To allow an individual process to use more of the node's memory, you need to allow some cores to remain idle, using the "-N" option, as follows:
aprun -n mpi_procs -N mpi_procs_per_node mpi_program [user_args]
where mpi_procs_per_node is the number of MPI processes to be started on each node. For example:
#### starts 32 mpi processes; only 16 on each node ## request 2 nodes, each with 32 cores and 16 processes per node #PBS -l select=2:ncpus=32:mpiprocs=16 aprun -n 32 -N 16 ./a.out ## (assigns only 16 processes per node)
For more information about aprun, see the aprun man page.
5.1.2. Shared Memory (SHMEM)
The logically shared, distributed-memory access (SHMEM) routines provide high-performance, high-bandwidth communication for use in highly parallelized scalable programs. The SHMEM data-passing library routines are similar to the MPI library routines: they pass data between cooperating parallel processes. The SHMEM data-passing routines can be used in programs that perform computations in separate address spaces and that explicitly pass data to and from different processes in the program.
The SHMEM routines minimize the overhead associated with data-passing requests, maximize bandwidth, and minimize data latency. Data latency is the length of time between a process initiating a transfer of data and that data becoming available for use at its destination.
SHMEM routines support remote data transfer through "put" operations that transfer data to a different process and "get" operations that transfer data from a different process. Other supported operations are work-shared broadcast and reduction, barrier synchronization, and atomic memory updates. An atomic memory operation is an atomic read and update operation, such as a fetch and increment, on a remote or local data object. The value read is guaranteed to be the value of the data object just prior to the update. See "man intro_shmem" for details on the SHMEM library after swapping to the cray-shmem module (covered below).
When creating a pure SHMEM program on Conrad, ensure the following:
That the MPI module is not loaded. To check this, run the "module list" command. If cray-mpich is listed, use the following command:
module unload cray-mpich
That the logically shared distributed memory access routines (module cray-shmem) are loaded. To check this, run the "module list" command. If cray-shmem is not listed, use the following command:
module load cray-shmem
That the source code includes one of the following lines:
INCLUDE 'mpp/shmem.fh' ## for Fortran, or #include <mpp/shmem.h> ## for C/C++
To compile a SHMEM program, use the following examples:
ftn -o shmem_program shmem_program.f90 ## for Fortran, or cc -o shmem_program shmem_program.c ## for C/C++
The ftn and cc wrappers resolve all SHMEM routine calls automatically. Specific mention of the SHMEM library is not required on the compilation line.
The program can then be launched using the aprun command, as follows:
aprun -n N shmem_program [user_arguments]
where N is the number of processes being started, with each process utilizing one core. The aprun command launches executables across a set of compute nodes. When each member of the parallel application has exited, aprun exits. For more information about aprun, type "man aprun".
5.1.3. Open Multi-Processing (OpenMP)
OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications. It supports shared-memory multiprocessing programming in C, C++ and Fortran, and consists of a set of compiler directives, library routines, and environment variables that influence compilation and run-time behavior.
When creating an OpenMP program on Conrad, ensure the following:
- That the default MPI module (cray-mpich) has been loaded. To check
this, run the "module list" command. If cray-mpich is not listed and a
different MPI module is listed, use the following command:
module swap other_mpi_module cray-mpich
If no MPI module is loaded, load the cray-mpich module.
module load cray-mpich
That if using OpenMP functions (for example, omp_get_wtime), the source code includes one of the following lines:
INCLUDE 'omp.h' ## for Fortran, or #include <omp.h> ## for C/C++
Or, if the code is written in Fortran 90 or later, the following line may be used instead:
USE omp_lib
That the compile command includes an option to reference the OpenMP library. The PGI, Cray, Intel, and GNU compilers support OpenMP, and each one uses a different option.
To compile an OpenMP program, use the following examples:
For C/C++ codes:
cc -o OpenMP_program -h omp OpenMP_program.c ## Cray cc -o OpenMP_program -mp=nonuma OpenMP_program.c ## PGI cc -o OpenMP_program -openmp OpenMP_program.c ## Intel cc -o OpenMP_program -fopenmp OpenMP_program.c ## GNU
For Fortran codes:
ftn -o OpenMP_program -h omp OpenMP_program.f ## Cray ftn -o OpenMP_program -mp=nonuma OpenMP_program.f ## PGI ftn -o OpenMP_program -openmp OpenMP_program.f ## Intel ftn -o OpenMP_program -fopenmp OpenMP_program.f ## GNU
See section 5.2 for additional information on available compilers.
When running OpenMP applications, the $OMP_NUM_THREADS environment variable must be used to specify the number of threads. For example:
setenv OMP_NUM_THREADS 32 aprun -d 32 ./OpenMP_program [user_arguments]
In the example above, the application starts the OpenMP_program on one node and spawns a total of 32 threads. Since Conrad has 32 cores per compute node, this yields 1 thread per core.
5.1.4. Hybrid Processing (MPI/OpenMP)
An application built with the hybrid model of parallel programming can run on Conrad using both OpenMP and Message Passing Interface (MPI). In hybrid applications, OpenMP threads can be spawned by MPI processes, but MPI calls should not be issued from OpenMP parallel regions or by an OpenMP thread.
When creating a hybrid (MPI/OpenMP) program on Conrad, follow the instructions in the MPI and OpenMP sections above for creating your program. Then use the compilation instructions for OpenMP.
Use the aprun command and the $OMP_NUM_THREADS environment variable to run a hybrid program. You may need aprun options "-n", "-N", and "-d" to get the desired combination of MPI processes, nodes, and cores.
aprun -n mpi_procs -N mpi_procs_per_node -d threads_per_mpi_proc mpi_program
Note the product of mpi_procs_per_node and threads_per_mpi_proc (-N * -d) should not exceed 32, the number of cores on a Conrad node.
In the following example, we want to run 8 MPI processes, and each MPI process needs about half the memory available on a node. We therefore request 4 nodes (128 cores). We also want each MPI process to launch 6 OpenMP threads, so we set the ompthreads select option accordingly and assign 6 threads per MPI process in the aprun command.
#### MPI/OpenMP on 4 nodes, 8 MPI processes total with 6 threads each ## request 4 nodes, each with 32 cores and 2 processes per node #PBS -l select=4:ncpus=32:mpiprocs=2:ompthreads=6 ## assign 8 MPI processes with 2 MPI processes per node aprun -n 8 -N 2 -d 6 ./xthi.x
In this example, each node gets two MPI processes, and all cores are assigned a thread. See the aprun man page for more detail on how MPI processes and threads are allocated on the nodes.
5.1.5. Partitioned Global Address Space (PGAS)
The Cray Fortran compiler supports Co-Array Fortran (CAF), and the Cray C compiler supports Unified Parallel C (UPC). These are PGAS extensions that enable the user to reference memory locations on any node, without the need for message-passing protocols. This can greatly simplify writing and debugging a parallel code. These compilers also allow the user to combine PGAS programming constructs with the flexibility of message-passing protocols. The PGAS extensions are not available for the PGI, Intel, or GNU compilers.
Cray Fortran and C reference manuals currently refer the reader to external sources for details on the CAF and UPC concepts and syntax.
Compilation of UPC and CAF codes is straightforward. Make sure to swap the programming environment module:
module swap PrgEnv-pgi PrgEnv-cray
Then, simply use the standard Cray compilers with the following flags:
ftn -o myprog -h caf myprog.f ## for Fortran cc -o myprog -h upc myprog.c ## for C/C++
Use the aprun command to execute the program as described above for MPI programs:
#PBS -l select=2:ncpus=32:mpiprocs=32 aprun -n 64 ./myprog
5.1.6. Accelerated Processing (Offload, Native, Native/Symmetric)
Accelerated processing takes the Hybrid Processing model and improves on it. Codes that take advantage of the many integrated core (MIC) architecture are those that vectorize well and those that leverage OpenMP threading. For more information on accelerated processing and the different modes available, please see the Xeon Phi User Guide.
5.2. Available Compilers
Conrad has four programming environment suites.
- Cray Fortran and C/C++
- Portland Group (PGI)
- Intel
- GNU
On Conrad, different sets of compilers are used to compile codes for serial vs. parallel execution.
Compiling for the Compute Nodes
Codes compiled to run on the compute nodes may be serial or parallel. The x86-64 instruction set for Intel Haswell Xeon E5-2698v3 processors has extensions for the Floating Point Unit (FPU) that require the module craype-haswell to be loaded. This module is loaded for you by default. To compile codes for execution on the compute nodes, the same compile commands are available in all programming environment suites as shown in the following table:
Language | Cray | PGI | Intel | GNU | Serial/Parallel |
---|---|---|---|---|---|
C | cc | cc | cc | cc | Serial/Parallel |
C++ | CC | CC | CC | CC | Serial/Parallel |
Fortran 77 | f77 | f77 | f77 | f77 | Serial/Parallel |
Fortran 90 | ftn | ftn | ftn | ftn | Serial/Parallel |
Compiling for the Login Nodes
Codes may be compiled to run on the login nodes in one of two ways. Either replace the craype-haswell module with the craype-target-native module and use the compiler commands from the table above, as follows:
module swap craype-haswell craype-target-native
cc myprog.c -o myprog.x
Or, use the serial compiler commands from the table below.
Language | Cray | PGI | Intel | GNU | Serial/Parallel |
---|---|---|---|---|---|
C | craycc | pgcc | icc | gcc | Serial |
C++ | crayCC | pgCC | icpc | g++ | Serial |
Fortran 77 | crayftn | pgf77 | ifort | gfortran | Serial |
Fortran 90 | crayftn | pgf90 | ifort | gfortran | Serial |
The Cray programming environment is loaded for you by default. To use a different programming suite, you will need to swap modules. See Relevant Modules (below) to learn how.
5.2.1. Cray Compiler Environment
The Cray compiler has a long tradition of high performance compilers for excellent vectorization (it vectorizes more loops than other compilers) and cache optimization (automatic blocking and automatic management of what stays in cache).
The Partitioned address space (PGAS) languages such as Unified Parallel C (UPC) and Co-Array Fortran are supported on Conrad via the Cray compiler.
The following table lists some of the more common options that you may use:
Option | Purpose |
---|---|
-c | Generate intermediate object file but do not attempt to link. |
-I directory | Search in directory for include or module files. |
-L directory | Search in directory for libraries. |
-o outfile | Name executable "outfile" rather than the default "a.out". |
-Olevel | Set the optimization level. For more information on optimization, see the section on Profiling and Optimization. |
-f free | Process Fortran codes using free form. |
-h byteswapio | Big-endian files; the default is for little-endian. |
-g | Generate symbolic debug information. |
-s integer64 -s real64 |
Treat integer and real variables as 64-bit. |
-s default64 | Pass -s integer64, -s real64 to compiler. |
(set by default) | Recognize OpenMP directives (disable "-h noomp"). |
-h upc ( only C) | Recognize UPC. |
-h caf | Recognize Co-Array Fortran. |
-h dynamic | Compiling using shared objects requires CCM mode for execution on compute nodes. |
-Ktrap=* | Trap errors such as floating point, overflow, and divide by zero (see man page). |
-fPIC | Generate position-independent code for shared libraries. |
Detailed information about these and other compiler options is available in the Cray compiler (crayftn, craycc, and crayCC) man pages on Conrad.
5.2.2. Portland Group (PGI) Compiler Suite
The PGI Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:
Option | Purpose |
---|---|
-c | Generate intermediate object file but do not attempt to link. |
-I directory | Search in directory for include or module files. |
-L directory | Search in directory for libraries. |
-o outfile | Name executable "outfile" rather than the default "a.out". |
-Olevel | Set the optimization level. For more information on optimization, see the section on Profiling and Optimization. |
-M free | Process Fortran codes using free form. |
-i8, -r8 | Treat integer and real variables as 64-bit. |
-Mbyteswapio | Big-endian files; the default is for little-endian. |
-g | Generate symbolic debug information. |
-Mbounds | Add array bound checking. |
-Minfo=all | Reports detailed information about code optimizations to stdout as compile proceeds. |
-Mlist | Generate a file containing the compiler flags used and a line numbered listing of the source code. |
-mp=nonuma | Recognize OpenMP directives. |
-Bdynamic | Compiling using shared objects requires CCM mode for execution on compute nodes. |
-Ktrap=* | Trap errors such as floating point, overflow, and divide by zero (see man page). |
-fPIC | Generate position-independent code for shared libraries. |
Detailed information about these and other compiler options is available in the PGI compiler (pgf95, pgcc, and pgCC) man pages on Conrad.
5.2.3. Intel Compiler Environment
The following table lists some of the more common options that you may use:
Option | Purpose |
---|---|
-c | Generate intermediate object file but do not attempt to link. |
-I directory | Search in directory for include or module files. |
-L directory | Search in directory for libraries. |
-o outfile | Name executable "outfile" rather than the default "a.out". |
-Olevel | Set the optimization level. For more information on optimization, see the section on Profiling and Optimization. |
-free | Process Fortran codes using free form. |
-Bstatic | Causes executable to link to all libraries statically. |
-fpic, or -fPIC | Generates position-independent objects. |
-convert big_endian | Big-endian files; the default is for little-endian. |
-g | Generate symbolic debug information. |
-Minfo=all | Reports detailed information about code optimizations to stdout as compile proceeds. |
-openmp | Recognize OpenMP directives. |
-mp=nonuma | Recognize OpenMP directives. |
-Bdynamic | Compiling using shared objects requires CCM mode for execution on compute nodes. |
-fpe-all=0 | Trap floating point, divide by zero, and overflow exceptions. |
-fPIC | Generate position-independent code for shared libraries. |
Detailed information about these and other compiler options is available in the Intel compiler (ifort, icc, and icpc) man pages on Conrad.
5.2.4. GNU Compiler Collection
The GNU Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:
Option | Purpose |
---|---|
-c | Generate intermediate object file but do not attempt to link. |
-I directory | Search in directory for include or module files. |
-L directory | Search in directory for libraries. |
-o outfile | Name executable "outfile" rather than the default "a.out". |
-Olevel | Set the optimization level. For more information on optimization, see the section on Profiling and Optimization. |
-g | Generate symbolic debug information. |
-Bstatic | Causes executable to link to all libraries statically. |
-fconvert=big-endian | Big-endian files; the default is for little-endian. |
-Wextra -Wall |
Turns on increased error reporting. |
Detailed information about these and other compiler options is available in the GNU compiler (gfortran, gcc, and g++) man pages on Conrad.
5.3. Relevant Modules
By default, Conrad loads the Cray programming environment for you. The PGI, Intel, and GNU environments are also available. To use either of these, the Cray module must be unloaded and replaced with the one you wish to use. To do this, use the "module swap" command as follows:
module swap PrgEnv-cray PrgEnv-pgi ## To switch to PGI module swap PrgEnv-cray PrgEnv-intel ## To switch to Intel module swap PrgEnv-cray PrgEnv-gnu ## To switch to GNU
In addition to the compiler suites, all of these modules also load the MPICH2 and LibSci modules. The MPICH2 module initializes MPI. The LibSci module includes solvers and single-processor and parallel routines that have been tuned for optimal performance on Cray XC systems (BLAS, LAPACK, ScaLAPACK, etc.). For additional information on the MPICH2 and LibSci modules, see the intro_mpi and intro_libsci man pages on Conrad.
The table below shows the naming convention for the various programming environment modules:
Module | Module Name |
---|---|
Cray CCE | PrgEnv-cray |
PGI | PrgEnv-pgi |
Intel | PrgEnv-intel |
GNU | PrgEnv-gnu |
Under each programming environment, the compiler version can be changed. With the default Cray programming environment, for example, the compiler version can be changed from the default to version 8.3.14 with this command:
module swap cce cce/8.3.14
Use the "module avail" command to see all the available compiler versions for Cray CCE, PGI, Intel, and GNU.
A number of Cray-optimized libraries (e.g., FFTW, HDF5, NetCDF, and PETSc) are available on Conrad with associated module files to set-up the necessary environment. As the environment depends on the active PrgEnv-* module, users should load library-related module files after changing the PrgEnv-* module.
When using SHMEM, load the cray-shmem module, as follows:
module load cray-shmem
For more information on using modules, see the Modules User Guide.
5.4. Libraries
Cray's LibSci and Intel's Math Kernel Libraries (Intel MKL) are both available on Conrad. In addition, an extensive suite of math and science libraries are available in the $PET_HOME directory.
5.4.1. Cray LibSci
Conrad provides Cray's LibSci library as part of the modules that are loaded by default. This library is a collection of single-processor and parallel numerical routines that have been tuned for optimal performance on Cray XC systems. The LibSci library contains optimized versions of many of the BLAS math routines as well as Cray versions of most of the ACML routines. Users can utilize the LibSci routines, instead of the public domain or user written versions, to optimize application performance on Conrad.
The routines in LibSci are automatically included when using the ftn, cc, or CC commands. You do not need to use the "-l sci" flag in your compile command line.
Cray LibSci includes the following:
- Basic Linear Algebra Subroutines (BLAS) - Levels 1, 2, and 3
- Linear Algebra Package (LAPACK)
- Scalable LAPACK (ScaLAPACK) (distributed-memory parallel set of LAPACK routines)
- Basic Linear Algebra Communication Subprograms (BLACS)
- Iterative Refinement Toolkit (IRT)
- SuperLU (for large, sparse nonsymmetrical systems of linear equations)
5.4.2. Intel Math Kernel Library (MKL)
Conrad provides the Intel Math Kernel Library (Intel MKL), a set of numerical routines tuned specifically for Intel platform processors and optimized for math, scientific, and engineering applications. The routines, which are available via both FORTRAN and C interfaces, include:
- LAPACK plus BLAS (Levels 1, 2, and 3)
- ScaLAPACK plus PBLAS (Levels 1, 2, and 3)
- Fast Fourier Transform (FFT) routines for single-precision, double-precision, single-precision complex, and double-precision complex data types
- Discrete Fourier Transforms (DFTs)
- Fast Math and Fast Vector Library
- Vector Statistical Library Functions (VSL)
- Vector Transcendental Math Functions (VML)
The MKL routines are part of the Intel Programming Environment as Intel's MKL is bundled with the Intel Compiler Suite.
Linking to the Intel Math Kernel Libraries can be complex and is beyond the scope of this introductory guide. Documentation explaining the full feature set along with instructions for linking can be found at the Intel Math Kernel Library documentation page.
Intel also makes a link advisor available to assist users with selecting proper linker and compiler options: http://software.intel.com/sites/products/mkl.
5.4.3. Additional Math Libraries
There is also an extensive set of Math libraries available in the $PET_HOME/MATH directory on Conrad. Information about these libraries can be found on the Baseline Configuration website at BC policy FY13-01.
5.5. Debuggers
Conrad provides the TotalView, DDT, and the GNU Project Debugger (gdb) debuggers to assist users in debugging their code.
5.5.1. TotalView
TotalView is a debugger that supports threads, MPI, OpenMP, C/C++, and Fortran, mixed-language codes, advanced features like on-demand memory leak detection, other heap allocation debugging features, and the Standard Template Library Viewer (STLView). Unique features like dive, a wide variety of breakpoints, the Message Queue Graph/Visualizer, powerful data analysis, and control at the thread level are also available.
Follow the steps below to use TotalView on Conrad via a UNIX X-Windows interface.
- Ensure that an X server is running on your local system. Linux users will likely have this by default, but MS Windows users will need to install a third party X Windows solution. There are various options available.
- For Linux users, connect to Conrad using "ssh -Y". Windows users will need to use PuTTY with X11 forwarding enabled (Connection->SSH->X11->Enable X11 forwarding).
- Compile your program on Conrad with the "-g" option.
Submit an interactive job:
qsub -l select=1:ncpus=32:mpiprocs=32 -A Project_ID -l walltime=00:30:00 -q debug -X -I
Once your job has been scheduled, you will be logged into an interactive batch session on a service node that is shared with other users.
Load the TotalView module:
module load totalview
Start program execution:
totalview aprun -a -n 4 ./my_mpi_prog.exe arg1 arg2 ...
- After a short delay, the TotalView windows will pop up. Click "GO" and then "Yes" to start program execution.
An example of using TotalView can be found in $SAMPLES_HOME/Programming/Totalview_Example on Conrad. For more information on using TotalView, see the TotalView Documentation page.
5.5.2. DDT
DDT is a debugger that supports threads, MPI, OpenMP, C/C++, and Fortran, Coarray Fortran, UPC, and CUDA. Memory debugging and data visualization are supported for large-scale parallel applications. The Parallel Stack Viewer is a unique way to see the program state of all processes and threads at a glance.
To use DDT on Conrad, follow steps 1 through 4 (above) as for TotalView, but load and use the DDT debugger instead.
- Load the DDT module:
module load ddt
- Start program execution:
ddt -n 4 ./my_mpi_prog.exe arg1 arg2 ...
- The DDT window will pop up. Verify the application name and number of MPI processes. Click "Run".
An example of using DDT can be found in $SAMPLES_HOME/Programming/DDT_Example on Conrad.
5.5.3. GDB
The GNU Project Debugger (gdb) is a source-level debugger that can be invoked either with a program for execution or a running process id. To launch your program under gdb for debugging, use the following command:
gdb a.out corefile
To attach gdb to a program that is already executing on a node, use the following command:
gdb a.out pid
For more information, the GDB manual can be found at http://www.gnu.org/software/gdb.
5.6. Code Profiling and Optimization
Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization, which increases the performance of a program by modifying certain aspects for increased efficiency.
We provide CrayPat to assist you in the profiling process. In addition, a basic overview of optimization methods with information about how they may improve the performance of your code can be found in Performance Optimization Methods (below).
5.6.1. CrayPat
CrayPat is an optional performance analysis tool used to evaluate program behavior on Cray supercomputer systems. CrayPat consists of the following major components: pat_build, pat_report, and pat_help. The data produced by CrayPat also can be used with Cray Apprentice2, an analysis tool that is used to visualize and explore the performance data captured during program execution.
Man pages are available for pat_build, pat_report, pat_help, and Apprentice2. Additional information can be found in the document "Using Cray Performance Analysis Tools."
The following steps should get you started using CrayPat:
- Load the "perftools" module
module load perftools
- Compile the code, creating object files.
ftn mycode.f90 -c
- Link the object files into your executable.
ftn *.o -o mycode.x
- Use the pat_build command to generate an instrumented executable.
pat_build -g mpi -u mycode.x mycode.x+pat
This generates an instrumented executable called mycode+pat. Here the "-g" option enables the "mpi" tracegroup. See "man pat_build" for available tracegroups.
- Run the instrumented executable with aprun via PBS.
aprun -n 4 ./mycode+pat
This generates an instrumented output file (e.g., mycode+pat+2007-12tdt.xf).
- Use pat_report to display the statistics from the output file
pat_report mycode+pat+2007-12tdt.xf > mycode.pat_report
Additional profiling options are available. See "man pat_build" for additional instrumentation options.
5.6.2. Additional Profiling Tools
There is also a set of profiling tools available in the $PET_HOME/pkgs directory on Conrad. Information about these tools may be found on the Baseline Configuration Web site at BC policy FY13-01.
5.6.3. Program Development Reminders
If an application is not programmed for distributed memory, then only the cores on a single node can be used. This is limited to 32 cores on Conrad.
Keep the system architecture in mind during code development. For instance, if your program requires more memory than is available on a single node, then you will need to parallelize your code so that it can function across multiple nodes.
5.6.4. Compiler Optimization Options
The "-Olevel" option enables code optimization when compiling. The level that you choose (0-4) will determine how aggressive the optimization will be. Increasing levels of optimization may increase performance significantly, but you should note that a loss of precision may also occur. There are also additional options that may enable further optimizations. The following table contains the most commonly used options.
Option | Description | Compiler Suite |
---|---|---|
-O0 | No Optimization. (default in GNU) | All |
-O1 | Scheduling within extended basic blocks is performed. Some register allocation is performed. No global optimization. | All |
-O2 | Level 1 plus traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer. Generally safe and beneficial. (default in PGI, Cray, & Intel) | All |
-O3 | Levels 1 and 2 plus more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable. Generally beneficial. | All |
-O4 | Levels 1, 2, and 3 plus hoisting of guarded invariant floating point expressions is enabled. | PGI |
-fast -fastsse |
Chooses generally optimal flags for the target platform. Includes: -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz. | PGI |
-Mipa=fast,inline | Performs Interprocedural Analysis (IPA) with generally optimal IPA flags for the target platform, and inlining. IPA can be very time-consuming. Flag must be used in both compilation and linking steps. | PGI |
Minline=levels:n | Number of levels of inlining (default: n = 1) | PGI |
-fipa-* | The GNU compilers automatically enable IPA at various -O levels. To set these manually, see the options beginning with -fipa in the gcc man page. | GNU |
-O ipan | Specifies various levels of inlining (n=0-5) | Cray |
-O vectorn | Specifies various levels of vectorization (n = 0-3) | Cray |
-finline-functions | Enables function inlining within a single file | Intel |
-ipon | Enables interprocedural optimization between files and produces up to n object files | Intel |
-inline-level=n | Number of levels of inlining (default: n=2) | Intel |
-ra | Creates a listing file with optimization info | Cray |
-Mlist | Creates a listing file with optimization info | PGI |
-Minfo | Info about optimizations performed | PGI |
-Mneginfo | Info on why certain optimizations are not performed | PGI |
-opt-reportn | Generate optimization report with n levels of detail | Intel |
-xHost | Compiler generates code with the highest instruction set available on the processor. | Intel |
5.6.5. Performance Optimization Methods
Optimization generally increases compilation time and executable size, and may make debugging difficult. However, it usually produces code that runs significantly faster. The optimizations that you can use will vary depending on your code and the system on which you are running.
Note: Before considering optimization, you should always ensure that your code runs correctly and produces valid output.
In general, there are four main categories of optimization:
- Global Optimization
- Loop Optimization
- Interprocedural Analysis and Optimization(IPA)
- Function Inlining
Global Optimization
A technique that looks at the program as a whole and may perform any of the following actions:
- Performed on code over all its basic blocks
- Performs control-flow and data-flow analysis for an entire program
- Detects all loops, including those formed by IF and GOTOs statements and performs general optimization.
- Constant propagation
- Copy propagation
- Dead store elimination
- Global register allocation
- Invariant code motion
- Induction variable elimination
Loop Optimization
A technique that focuses on loops (for, while, etc.,) in your code and looks for ways to reduce loop iterations or parallelize the loop operations. The following types of actions may be performed:
- Vectorization - rewrites loops to improve memory access performance. Some compilers may also support automatic loop vectorization by converting loops to utilize low-level hardware instructions and registers if they meet certain criteria.
- Loop unrolling - (also known as "unwinding") replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization.
- Parallelization - divides loop operations over multiple processors where possible.
Interprocedural Analysis and Optimization (IPA)
A technique that allows the use of information across function call boundaries to perform optimizations that would otherwise be unavailable.
Function Inlining
A technique that seeks to reduce function call and return overhead. It:
- Is used with functions that are called numerous times from relatively few locations.
- Allows a function call to be replaced by a copy of the body of that function.
- May create opportunities for other types of optimization
- May not be beneficial. Improper use may increase code size and actually result in less efficient code.
6. Batch Schedulingto top
6.1. Scheduler
The Portable Batch System (PBS) is currently running on Conrad. It schedules jobs and manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch request. PBS is able to manage both single-processor and multiprocessor jobs. The PBS module is automatically loaded for you when you log in.
6.2. Queue Information
The following table describes the PBS queues available on Conrad:
Priority | Queue Name |
Job Class |
Max Wall Clock Time |
Max Cores Per Job |
Comments |
---|---|---|---|---|---|
Highest | urgent | Urgent | 24 Hours | 768 | Designated urgent projects by DoD HPCMP |
![]() |
frontier | Frontier | 168 Hours | 16,000 | Frontier projects only |
high | High | 168 Hours | 16,000 | Designated high-priority projects by service/agency | |
debug | Debug | 30 Minutes | 3,072 | User diagnostic jobs | |
standard | Standard | 168 Hours | 8,000 | Normal priority user jobs | |
phi | N/A | 168 Hours | 2,376 | Phi-accelerated jobs | |
bigmem | N/A | 24 Hours | 224 | Large-memory jobs | |
transfer | N/A | 48 Hours | N/A | Data transfer jobs | |
Lowest | background | Background | 4 Hours | 2,048 | User jobs that will not be charged against the project allocation. |
6.3. Interactive Logins
When you log in to Conrad, you will be running in an interactive shell on a login node. The login nodes provide login access for Conrad and support such activities as compiling, editing, and general interactive use by all users. Please note the Login Node Abuse policy. The preferred method to run resource intensive executions is to use an interactive batch session.
6.4. Interactive Batch Sessions
An interactive session on a compute node is possible using the PBS qsub command with the "-I" option from a login node. Once PBS has scheduled your request to the specified queue, you will be directly logged into a compute node, and this session can last as long as your requested wall time. For example:
qsub -l select=N1:ncpus=32:mpiprocs=N2 -A Project_ID -q queue_name -l walltime=HHH:MM:SS -I
You must specify the number of nodes requested (N1), the number of processes per node (N2), the desired maximum walltime, your project ID, and a job queue. Valid values for N2 are between 1 and 32.
Your interactive batch sessions will be scheduled just as normal batch jobs are scheduled depending on the other queued batch jobs, so it may take quite a while. Once your interactive batch shell starts, it will be running on a service node that is shared by other users. At this point, you can launch parallel applications onto your assigned set of compute nodes by using the aprun command. You can also run interactive commands or scripts on this service node, but you should limit your memory and cpu usage. Use the Cluster Compatibility Mode for executing memory- and process-intensive commands such as tar and gzip/gunzip and certain serial applications directly on a dedicated compute node.
6.5. Cluster Compatibility Mode (CCM)
You can also request direct access to a compute node by including the "ccm" option in your PBS interactive batch job submission. For example:
qsub -l ccm=1 -l select=N1:ncpus=32:mpiprocs=N2 -A Project_ID -q queue_name -l walltime=HHH:MM:SS -I
You must specify the number of nodes requested (N1) and the number of processes per node (N2), the desired maximum walltime, your project ID, and a job queue.
Once scheduled by the PBS scheduler, you will again have an interactive shell session on a shared service node. Then, issue the "ccmlogin" command, and you will be logged onto the first compute node in the set of nodes to which you have been assigned. Your environment will react much the same as a normal shared service node. However, you will now have dedicated access to the entire compute node which will allow you to run serial applications as well as memory- and process-intensive commands such as tar and gzip/gunzip without affecting other users.
6.6. Batch Request Submission
PBS batch jobs are submitted via the qsub command. The format of this command is:
qsub [ options ] batch_script_file
qsub options may be specified on the command line or embedded in the batch script file by lines beginning with "#PBS".
For a more thorough discussion of PBS batch submission, see the Conrad PBS Guide.
6.7. Batch Resource Directives
Batch resource directives allow you to specify to PBS how your batch jobs should be run and what resources your job requires. Although PBS has many directives, you only need to know a few to run most jobs.
The basic syntax of PBS directives is as follows:
#PBS option[[=]value]
where some options may require values to be included. For example, to start a 16-process job, you would request one node of 32 cores and specify that you will be running 16 processes per node:
#PBS -l select=1:ncpus=32:mpiprocs=16
The following directives are required for all jobs:
Directive | Value | Description |
---|---|---|
-A | Project_ID | Name of the project |
-q | queue_name | Name of the queue |
-l | select=N1:ncpus=32:mpiprocs=N2 | N1 = Number of nodes N2 = MPI processes per node (N2 can be between 1 and 32) |
-l | walltime=HHH:MM:SS | Maximum wall clock time |
Directive | Value | Description |
---|---|---|
-N | Job Name | Name of the job. |
-e | File name | Redirect standard error to the name file. |
-o | File name | Redirect standard output to the name file. |
-j | oe | Merge standard error and standard output into standard output. |
-l application | application_name | Identify the application being used. |
-I | Request an interactive batch shell. | |
-V | Export all environment variables to the job. | |
-v | Variable list | Export specific environment variables to the job. |
A more complete listing of batch resource directives is available in the Conrad PBS Guide.
6.8. Launch Commands
On Conrad the PBS batch scripts and the PBS interactive login session run on a service node, not a compute node. The only way to send your executable to the compute nodes is to use the aprun command. The following example command line could be used within your batch script or in a PBS interactive session, sending the executable ./a.out to 64 compute cores.
aprun -n 64 ./a.out
Option | Description |
---|---|
-n # | The total number of MPI processes. |
-N # | The number of MPI processes to place per node. Useful for getting more memory per MPI process. |
-d # | The number of threads per node in OpenMP. |
-B | Directs aprun to get values for -n, -N, and -d from PBS directives instead of from the aprun command line. Simplifies and saves time. |
-S # | The number of MPI processes to place per NUMA (8 cores with shared L3 cache). Useful for getting more L3 cache per process. |
-j 1 | Run in single-stream mode, using only one core per core pair. Useful for getting more L2 cache, memory, resources per MPI process. |
For more in-depth discussion of the aprun options, consult the aprun man page and the Conrad PBS Guide.
A serial executable can be sent to one compute node using aprun or ccmrun:
aprun -n 1 serial_executable ## OR
ccmrun serial_executable
It is also possible to run a script on one compute node using ccmrun when the Cluster Compatibility Mode has been envoked (-l ccm=1).
ccmrun script_to_run
Use aprun to launch MPI, SHMEM, OpenMP, Hybrid MPI/OpenMP, and PGAS executables. For examples of this, see MPI, SHMEM, OpenMP, Hybrid MPI/OpenMP, and PGAS (above) or look in the $SAMPLES_HOME directory on Conrad. For more information about aprun, see the aprun man page.
6.9. Sample Scripts
While it is possible to include all PBS directives at the qsub command line, the preferred method is to embed the PBS directives within the batch request script using "#PBS". The following script is a basic example and contains all of the required directives, some frequently used optional directives, and common script components. It starts 64 processes on 2 nodes of 32 cores each, with one MPI process per core. More thorough examples are available in the Conrad PBS Guide and in the Sample Code Repository ($SAMPLES_HOME) on Conrad.
The following example is a good starting template for a batch script to run a serial job for one hour:
#!/bin/bash ## Specify your shell # # Specify name of the job #PBS -N serialjob # # Append std output to file serialjob.out #PBS -o serialjob.out # # Append std error to file serialjob.err #PBS -e serialjob.err # # Specify Project ID to be charged (Required) #PBS -A Project_ID # # Request wall clock time of 1 hour (Required) #PBS -l walltime=01:00:00 # # Specify queue name (Required) #PBS -q standard # # Specify the number cores (Required) #PBS -l select=1:ncpus=1 # #PBS -S /bin/bash # Change to the specified directory cd $WORKDIR # # Execute the serial executable on 1 core aprun ./serial_fort.exe # End of batch job
The first few lines tell PBS to save the standard output and error output to the given files, and to give the job a name. Skipping ahead, we estimate the run-time to be about one hour and know that this is acceptable for the standard batch queue. We need one core in total, so we request one core.
The following example is a good starting template for a batch script to run a parallel (MPI) job for two hours:
#!/bin/bash ## The first line (above) specifies the shell to use for parsing ## the remaining lines of the batch script. # ## Required PBS Directives -------------------------------------- #PBS -A Project_ID #PBS -q standard #PBS -l select=2:ncpus=32:mpiprocs=32 #PBS -l walltime=02:00:00 # ## Optional PBS Directives -------------------------------------- #PBS -N Test_Run_1 #PBS -j oe #PBS -V #PBS -S /bin/bash ## Option for Cluster Compatability Mode, ccm=1, #PBS -l ccm=1 # ## Execution Block ---------------------------------------------- # Environment Setup # cd to your personal directory in the scratch file system cd $WORKDIR # # create a job-specific subdirectory based on JOBID and cd to it JOBID=`echo $PBS_JOBID | cut -d '.' -f 1` if [ ! -d $JOBID ]; then mkdir -p $JOBID fi cd $JOBID # # Launching # copy executable from $HOME and submit it cp $HOME/mympiprog.exe . aprun -n 64 ./mympiprog.exe > mympiprog.out # # Clean up # archive your results # Using the "here document" syntax, create a job script # for archiving your data. cd $WORKDIR rm -f archive_job cat > archive_job << END #!/bin/bash #PBS -l walltime=06:00:00 #PBS -q transfer #PBS -A Project_ID #PBS -l select=1:ncpus=1 #PBS -j oe #PBS -S /bin/bash cd $WORKDIR rsh $ARCHIVE_HOST mkdir $ARCHIVE_HOME/$JOBID rcp -r $JOBID $ARCHIVE_HOST:$ARCHIVE_HOME/ rsh $ARCHIVE_HOST ls -l $ARCHIVE_HOME/$JOBID # Remove scratch directory from the file system. rm -rf $JOBID END # # Submit the archive job script. qsub archive_job # End of batch job
The first few lines tell PBS to save the standard output and error output to the given files, and to give the job a name. Skipping ahead, we estimate the run-time to be about 2 hours and know that this is acceptable for the standard batch queue. The next couple of lines set the total number of cores and the number of cores per node for the job. This job is requesting 64 total cores and 32 cores per node allowing the job to run on 2 nodes. The default value for number of cores per node is 32.
6.10. PBS Commands
The following commands provide the basic functionality for using the PBS batch system:
qsub: Used to submit jobs for batch processing.
qsub [ qsub_options ] my_job_script
qstat: Used to check the status of submitted jobs.
qstat PBS_JOBID ## check one job
qstat -u my_user_name ## check all of user's jobs
qdel: Used to kill queued or running jobs.
qdel PBS_JOBID
A more complete list of PBS commands is available in the Conrad PBS Guide.
6.11. Advance Reservations
A subset of Conrad's nodes has been set aside for use as part of the Advance Reservation Service (ARS). The ARS allows users to reserve a user-designated number of nodes for a specified number of hours starting at a specific date/time. This service enables users to execute interactive or other time-critical jobs within the batch system environment. The ARS is accessible via most modern web browsers at https://reservation.hpc.mil. Authenticated access is required. The ARS User Guide is available on HPC Centers.
7. Software Resourcesto top
7.1. Application Software
A complete listing with installed versions can be found on our software page. The general rule for all COTS software packages is that the two latest versions will be maintained on our systems. For convenience, modules are also available for most COTS software packages.
7.2. Useful Utilities
The following utilities are available on Conrad:
Utility | Description |
---|---|
check_license | Checks the status of HPCMP shared applications. |
mpscp | High-performance remote file copy. |
node_use | Display the amount of free and used memory for login nodes. |
qpeek | Display spooled stdout and stderr for an executing batch job. |
qview | Display information about batch jobs and queues. |
show_queues | Report current batch queue status, usage, and limits. |
show_storage | Display archive server allocation and usage by subproject. |
show_usage | Display CPU allocation and usage by subproject. |
7.3. Sample Code Repository
The Sample Code Repository is a directory that contains examples for COTS batch scripts, building and using serial and parallel programs, data management, and accessing and using serial and parallel math libraries. The $SAMPLES_HOME environment variable contains the path to this area, and is automatically defined in your login environment. Below is a listing of the examples p rovided in the Sample Code Repository on Conrad.
Application_Name Use of the application name resource. | |
Sub-Directory | Description |
application_names | README and list of valid strings for application names intended for use in every PBS script preamble. The HPCMP encourages applications not specifically named in the list to be denoted as "other". |
Applications Application-specific examples; interactive job submit scripts; use of the
application name resource; software license use. | |
Sub-Directory | Description |
abaqus | Basic PBS batch script, input deck and README file instructing how to run an Abaqus job |
fluent | Instructions, PBS job submission scripts, input data and example output data for the FLUENT CFD application. |
gamess | Instructions, PBS batch script and example input data file for the GAMESS CCM application. |
matlab | Basic batch script for running a MATLAB job |
Data_Management Archiving and retrieving files; Lustre striping; file searching; $WORKDIR
use. | |
Sub-Directory | Description |
MPSCP_to_Archive_Example | Instructions and sample scripts on using the mpscp utility for transferring files to/from the archive server. |
Transfer_Queue_Example | Sample batch script for data transfer to archive |
Postprocess_Example | Example showing how to submit a post-processing script at the end of a parallel computation job to copy files from temporary storage to the archive storage |
Transfer_Queue_with_Archive_Commands | Example and instructions on recommended best practice to stage data from mass storage using the transfer queue prior to job execution, then processing using that data, then passing output data back to the archive storage using the transfer queue again. |
Parallel_Environment MPI, OpenMP, and hybrid examples; single-core jobs; large memory jobs;
running multiple applications within a single batch job. | |
Sub-Directory | Description |
Calculate_Prime_MPI | Sample code and scripts for compiling and executing an MPI code |
CCMRUN_Example | Using CCM to run simultaneous compute tasks on the compute nodes, and X11 forwarding with ccmlogin to a compute node. |
Cluster_Mode | Using the Cluster Compatibility Mode (CCM). |
Hello_World_Example | Sample codes and job scripts for MPI, OpenMP and Hybrid MPI/OpenMP application of hello world batch jobs. |
Multiple_Jobs_per_Node | The following examples demonstrate how to set up a PBS job to perform multiple simultaneous computation tasks: |
(Multiple_Parallel) | Running multiple parallel tasks simultaneously on four nodes. |
(Mix_Serial_Parallel) | Running serial mixed with parallel |
(Serial_Jobs_on_One_Node) | Running multiple serial jobs on one compute node |
(Multi_Exec_One_Comm) | Running multiple parallel binaries simultaneously on separate cores at the same time using the same communicator. |
Programming Basic code compilation; debugging; use of library files; static vs. dynamic
linking; Makefiles; Endian conversion. | |
Sub-Directory | Description |
BLACS_Example | Sample BLACS Fortran program, compile script and PBS submission scripts. |
CCM-DSL_Example | Compile and run a dynamically linked executable with DSL and CCM support on compute nodes. |
COMPILE_INFO | Instructions for using the (default) Cray compilers to compile Fortran and C code using MPI. Various optimization flags are discussed. Includes the flags to compile OpenMP code for all programming environments. Also includes instructions for using the configure utility to create a Makefile for any system. |
Core_Files | Instructions and source code for viewing core files with different viewers. |
CrayPat_Example | Instructions for using the CrayPat profiling tool. |
DDT_Example | Using DDT to debug a small example code in an interactive batch job. |
Endian_Conversion | Text file presenting the PGI, Cray, GNU and Intel compiler options for enabling binary data formatted on a different architecture to be readable by code compiled on Conrad. |
Link_Libraries | The libraries for linking are different from similar Cray machines, in large part due to changes in X11 libraries. |
Memory_Usage | A routine callable from Fortran or C for determining how much memory a process is using. |
Phi_Example | Several examples demonstrating use of system tools, compilation techniques and PBS scripts to generate and execute code using the Phi accelerators. |
ScaLAPACK_Example | Sample ScaLAPACK Fortran program, compile script and PBS submission scripts. |
SharedObject_Compile | Sample Shared Object compile info. |
Timers_Fortran | Serial Timers using Fortran Intrinsics f77 and f90/95. |
TotalView_Example | Using TotalView to debug a small example code in an interactive batch job. |
User_Environment Use of modules; customizing the login environment. | |
Sub-Directory | Description |
Module_Swap_Example | Batch script demonstrating how to swap one module version for another. |
Workload_Management Basic batch scripting; use of the transfer queue; job arrays; job
dependencies; Secure Remote Desktop; job monitoring. | |
Sub-Directory | Description |
BatchScript_Example | Simple PBS batch script showing all required options. |
Hybrid_Example | Sample job script for running hybrid MPI/OpenMP jobs. |
Job_Array_Example | Sample job script for using job arrays. |
MPI_Example | Sample script for running MPI jobs. |
OpenMP_Example | Sample script for running OpenMP jobs. |
Serial_Example | Sample scripts for running multiple concurrent sequential jobs. |
Transfer_Example | Sample batch script for data transfer. |
pbs_scripts | Simple PBS batch scripts for MPI, OpenMP, hybrid MPI/OpenMP and transfer jobs. |
8. Links to Vendor Documentationto top
8.1. Cray Links
Cray Home: http://docs.cray.com
Cray Application Developer's Environment User's Guide:
http://docs.cray.com/books/S-2396-50/S-2396-50.pdf
8.2. SUSE Links
Novell Home: http://www.novell.com/linux
Novell SUSE Linux Enterprise Server:
http://www.novell.com/products/server
8.3. GNU Links
GNU Home: http://www.gnu.org
GNU Compiler:
http://gcc.gnu.org/onlinedocs
8.4. Portland Group (PGI) Links
Portland Group Resources Page:
http://www.pgroup.com/resources
Portland Group User's Guide:
http://www.pgroup.com/doc/pgiug.pdf
8.5. Intel Links
Intel Documentation:
http://software.intel.com/en-us/intel-software-technical-documentation
Intel Compiler List:
http://software.intel.com/en-us/intel-compilers
8.6. Debugger Links
TotalView Documentation:
http://www.roguewave.com/support/product-documentation/totalview.aspx
DDT Tutorials:
https://developer.arm.com/products/software-development-tools/hpc/arm-forge/arm-ddt/video-demos-and-tutorials-for-arm-ddt