Navy DSRC: Narwhal User Guide

HPE Cray EX (Narwhal)
User Guide

1. Introduction
1.1. Document Scope and Assumptions
1.2. Policies to Review
1.2.1. Login Node Abuse Policy
1.2.2. Workspace Purge Policy
1.3. Obtaining an Account
1.4. Requesting Assistance
2. System Configuration
2.1. System Summary
2.2. Processors
2.3. Memory
2.4. Operating System
2.5. File Systems
2.5.1. /p/home
2.5.2. /p/work1
2.5.3. /p/work2
2.5.4. /archive
2.5.5. /p/cwfs/
2.6. Peak Performance
3. Accessing the System
3.1. Kerberos
3.2. Logging In
3.3. File Transfers
4. User Environment
4.1. User Directories
4.1.1. Home Directory
4.1.2. Work Directory
4.1.3. Center Directory
4.2. Shells
4.3. Environment Variables
4.3.1. Login Environment Variables
4.3.2. Batch-Only Environment Variables
4.4. Modules
4.5. Archive Usage
4.5.1. Archival Command Synopsis
5. Program Development
5.1. Programming Models
5.1.1. Message Passing Interface (MPI)
5.1.2. Open Multi-Processing (OpenMP)
5.1.3. Hybrid Processing (MPI/OpenMP)
5.2. Available Compilers
5.2.1. Cray Compiler Environment
5.2.2. Intel Compiler Environment
5.2.3. GNU Compiler Environment
5.2.4. AOCC Compiler Environment
5.2.5. NVIDIA Compiler Environment
5.3. Relevant Modules
5.4. Libraries
5.4.1. Cray Scientific and Math Libraries (CSML) LibSci
5.4.2. Intel Math Kernel Library (MKL)
5.4.3. Additional Math Libraries
5.5. Debuggers
5.5.1. Cray Debugger Support Tools (CDST)
5.5.1.1. Gdb4hpc
5.5.1.2. Valgrind4hpc
5.5.1.3. Stack Trace Analysis Tool (STAT)
5.5.1.4. Abnormal Termination Processing (ATP)
5.5.1.5. Cray Comparative Debugger (CCDB)
5.5.2. Forge (formerly DDT)
5.5.3. GDB
5.6. Code Profiling and Optimization
5.6.1. CrayPat
5.6.2. gprof
5.6.3. Codecov
5.6.4. Additional Profiling Tools
5.6.5. Program Development Reminders
5.6.6. Compiler Optimization Options
5.6.7. Performance Optimization Methods
6. Batch Scheduling
6.1. Scheduler
6.2. Queue Information
6.3. Interactive Logins
6.4. Interactive Batch Sessions
6.5. Batch Request Submission
6.6. Batch Resource Directives
6.7. Launch Commands
6.8. Sample Scripts
6.9. PBS Commands
6.10. Determining Time Remaining in a Batch Job
6.11. Advance Reservations
7. Software Resources
7.1. Application Software
7.2. Useful Utilities
7.3. Sample Code Repository
8. Links to Vendor Documentation
8.1. HPE Cray Links
8.2. SUSE Links
8.3. GNU Links
8.4. AMD Links
8.5. Intel Links
8.6. NVIDIA Links
8.7. Debugger Links

1. Introductionto top

1.1. Document Scope and Assumptions

This document provides an overview and introduction to the use of the HPE Cray EX, Narwhal, located at the Navy DSRC, along with a description of the specific computing environment on Narwhal. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

Use of the UNIX operating system
Use of an editor (e.g., vi or emacs)
Remote usage of computer systems via network or modem access
A selected programming language and its related tools and libraries

1.2. Policies to Review

Users are expected to be aware of the following policies for working on Narwhal.

1.2.1. Login Node Abuse Policy

The login nodes provide login access for Narwhal and support such activities as compiling, editing, and general interactive use by all users. Consequently, memory- or CPU-intensive programs running on the login nodes can significantly affect all users of the system. Therefore, only small serial applications requiring less than 15 minutes of compute time and less than 8 GB of memory are recommended on the login nodes. Any jobs running on the login nodes that exceed these limits will be terminated.

1.2.2. Workspace Purge Policy

Close management of space in the /p/work1 and /p/work2 file systems is a high priority. Files in either of these file systems that have not been accessed in 21 days are subject to the purge cycle. If available space becomes critically low, a manual purge may be run, and all files in either of these file systems are eligible for deletion. Using the touch command (or similar commands) to prevent files from being purged is prohibited. Users are expected to keep up with file archival and removal within the normal purge cycles.

Note! If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.

1.3. Obtaining an Account

The process of getting an account on the HPC systems at any of the DSRCs begins with getting an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account." If you do not yet have a pIE User Account, please visit HPC Centers: Obtaining an Account and follow the instructions there. If you need assistance with any part of this process, please contact the HPC Help Desk at accounts@helpdesk.hpc.mil.

1.4. Requesting Assistance

The HPC Help Desk is available to help users with unclassified problems, issues, or questions. Analysts are on duty 8:00 a.m. - 8:00 p.m. Eastern, Monday - Friday (excluding Federal holidays).

Web: https://helpdesk.hpc.mil
E-mail: help@helpdesk.hpc.mil
Phone: 1-877-222-2039 or (937) 255-0679

You can contact the Navy DSRC in any of the following ways for after-hours support and for support services not provided by the HPC Help Desk :

E-mail: dsrchelp@navydsrc.hpc.mil
Phone: 1-800-993-7677 or (228) 688-7677
Fax: (228) 688-4356
U.S. Mail:
Navy DoD Supercomputing Resource Center
1002 Balch Boulevard
Stennis Space Center, MS 39522-5001

For more detailed contact information, please see our Contact Page.

2. System Configurationto top

2.1. System Summary

Narwhal is an HPE Cray EX system. The login, compute, large-memory, visualization accelerated, and MLA accelerated compute nodes are populated with AMD Epyc ROME 7H12 processors clocked at 2.6 GHz.

Narwhal uses the HPE Slingshot interconnect as its high-speed network for MPI messages and I/O traffic. Narwhal uses Lustre to manage its parallel file system that targets the disk RAID arrays.

Narwhal has 2,176 compute nodes that share memory only on the node; memory is not shared across the nodes.

Each standard compute node has two 64-core processors (128 cores) sharing 256 GB of DDR4 memory

Each large-memory compute node has two 64-core processors (128 cores) sharing 1 TB of DDR4 memory, with a 1.8-TB /tmp/scratch SSD.

Each visualization compute node has an NVIDIA V100-PCIE GPU and 256 GB of DDR4 3200-GHz memory.

Each single Machine Learning Accelerator (MLA) node has one NVIDIA V100-PCIE GPU and 256 GB DDR4 3200-GHz memory, with an 880-GB /tmp/scratch SSD.

Each dual Machine Learning Accelerator (MLA) node has two NVIDIA V100-PCIE GPUs and 256 GB DDR4 3200-GHz memory, with a 900-GB /tmp/scratch SSD.

Narwhal is rated at 12.8 peak PFLOPS and has 16 PB (formatted) of parallel disk storage.

Narwhal is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, I/O, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

Node Configuration
	Login	Standard	Large-Memory	Visualization	MLA 1-GPU	MLA 2-GPU
Total Nodes	11	2,304	26	16	32	32
Processor	AMD 7H12 Rome	AMD 7H12 Rome	AMD 7H12 Rome	AMD 7H12 Rome	AMD 7H12 Rome	AMD 7H12 Rome
Processor Speed	2.6 GHz	2.6 GHz	2.6 GHz	2.6 GHz	2.6 GHz	2.6 GHz
Sockets / Node	2	2	2	2	2	2
Cores / Node	128	128	128	128	128	128
Total CPU Cores	1,408	294,912	3,328	2,048	4,096	4,096
Usable Memory / Node	226 GB	238 GB	995 GB	234 GB	239 GB	239 GB
Accelerators / Node	None	None	None	1	1	2
Accelerator	n/a	n/a	n/a	NVIDIA V100 PCIe 3	NVIDIA V100 PCIe 3	NVIDIA V100 PCIe 3
Memory / Accelerator	n/a	n/a	n/a	32 GB	32 GB	32 GB
Storage on Node	880 GB SSD	None	1.8 TB SSD	None	880 GB SSD	880 GB SSD
Interconnect	HPE Slingshot	HPE Slingshot	HPE Slingshot	HPE Slingshot	HPE Slingshot	HPE Slingshot
Operating System	SLES	SLES	SLES	SLES	SLES	SLES

File Systems on Narwhal
Path	Formatted Capacity	File System Type	Storage Type	User Quota	Minimum File Retention
/p/home (`$HOME`)	672 TB	Lustre	HDD	250 GB	None
/p/work1 (`$WORKDIR`)	14 PB	Lustre	HDD	100 TB	21 Days
/p/work2	1.1 PB	Lustre	NVMe SSD	25 TB	21 Days
/p/cwfs (`$CENTER`)	3.3 PB	GPFS	HDD	100 TB	180 Days
/p/app (`$PROJECTS_HOME`)	336 TB	Lustre	HDD	None	None

2.2. Processors

Narwhal uses the 2.6-GHz AMD Epyc 7H12 processors on its login nodes. There are two processors per node with 64 cores, for a total of 128 cores per node. These processors have 64 KB of L1 cache per core, 512 KB of L2 cache per core, and 256 MB of shared L3 cache.

Narwhal uses the 2.6-GHz AMD Epyc 7H12 processors on its standard compute nodes. There are two processors per node, each with 64 cores, for a total of 128 cores per node.

Narwhal uses the 2.6-GHz AMD Epyc 7H12 processors on its data transfer nodes. There are two processors per node, each with 64 cores, for a total of 128 cores per node.

Narwhal uses the 2.6-GHz AMD Epyc 7H12 processors on its large memory nodes. There are two processors per node, each with 64 cores, for a total of 128 cores per node.

Narwhal's visualization accelerated nodes use a 2.6-GHz AMD Epyc 7H12 processors and one NVIDIA Tesla V100-PCIE GPU with 5,120 CUDA cores and 640 Tensor cores.

Narwhal's Single Machine Learning Accelerator (1-MLA) nodes use a 2.6-GHz AMD Epyc 7H12 processors and one NVIDIA Tesla V100-PCIE GPU with 5,120 CUDA cores and 640 Tensor cores.

Narwhal's Dual Machine Learning Accelerator (2-MLA) nodes use a 2.6-GHz AMD Epyc 7H12 processors and two NVIDIA Tesla V100-PCIE GPUs with 5,120 CUDA cores and 640 Tensor cores.

2.3. Memory

Narwhal uses both shared- and distributed-memory models. Memory is shared among all the cores on a node, but is not shared among the nodes across the cluster.

Each login node contains 256 GB of main memory. All memory and cores on the node are shared among all users who are logged in. Therefore, users should not use more than 8 GB of memory at any one time.

Each standard compute node contains 246 GB of user-accessible shared memory.

Each large-memory compute node contains 990 GB of user-accessible shared memory.

Each visualization accelerated node contains 246 GB of user-accessible shared memory on the standard compute portion of the node and approximately 32 GB on the NVIDIA Tesla V100-PCIE portion of the node.

Each 1-MLA accelerated node contains 246 GB of user-accessible shared memory on the standard compute portion of the node and approximately 32 GB on the NVIDIA Tesla V100-PCIE portion of the node.

Each 2-MLA accelerated node contains 246 GB of user-accessible shared memory on the standard compute portion of the node and approximately 32 GB on the NVIDIA Tesla V100-PCIE portion of the node.

2.4. Operating System

The operating system on Narwhal is SUSE Enterprise Linux (SLES).

2.5. File Systems

Narwhal has the following file systems available for user storage:

2.5.1. /p/home

This file system is locally mounted from Narwhal's Lustre file system and has a formatted capacity of 739 TB. All users have a home directory located on this file system which can be referenced by the environment variable $HOME.

2.5.2. /p/work1

This file system is locally mounted from Narwhal's Lustre file system and is tuned for parallel I/O. It has a formatted capacity of 14.7 PB. All users have a work directory located on this file system which can be referenced by the environment variable $WORKDIR. This file system is not backed up. Users are responsible for making backups of their files to the archive server or to some other local system.

2.5.3. /p/work2

The /p/work2 file system is a locally mounted Lustre file system comprised of NVMe SSDs. It has a formatted capacity of 1.1 PB and is not backed up. Users are responsible for making backups of their files to the archive server or to some other local system.

2.5.4. /archive

The NFS-mounted file system is accessible from the login and transfer nodes on Narwhal. Files in this file system are subject to migration to tape and access may be slower due to the overhead of retrieving files from tape. It has a formatted capacity of 60 TB with a petascale archival tape storage system. Users should migrate all important files to this area for long-term storage. All users have a directory located on this file system which can be referenced by the environment variable $ARCHIVE_HOME.

2.5.5. /p/cwfs/

This path is directed to the Center-Wide File System (CWFS) which is meant for short-term storage (no longer than 180 days). All users have a directory defined in this file system which can be referenced by the environment variable $CENTER. It is accessible from the HPC system login nodes and the HPC Portal. The CWFS has a formatted capacity of 3300 TB and is managed by IBM's Spectrum Scale (formerly GPFS).

2.6. Peak Performance

Narwhal is rated at 12.8 peak PFLOPS.

3. Accessing the Systemto top

3.1. Kerberos

A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to Narwhal. More information about installing Kerberos clients on your desktop can be found at HPC Centers: Kerberos & Authentication.

3.2. Logging In

The system host name for the Narwhal cluster is narwhal.navydsrc.hpc.mil, which will redirect the user to one of eleven login nodes. Hostnames and IP addresses to these nodes are available upon request from the HPC Help Desk.

Kerberized SSH
The recommended method is to use dynamic assignment, as follows:
% ssh -l username narwhal.navydsrc.hpc.mil
Alternatively, you can manually specify a particular login node, as follows:
% ssh -l username narwhal#.navydsrc.hpc.mil (# = 01 - 11)

3.3. File Transfers

File transfers to DSRC systems (except those to the local archive system) must be performed using Kerberized versions of the following tools: scp, mpscp, sftp, and kftp. Before using any Kerberized tool, you must use a Kerberos client to obtain a Kerberos ticket. Information about installing and using a Kerberos client can be found at HPC Centers: Kerberos & Authentication.

The command below uses secure copy (scp) to copy a single local file into a destination directory on a Narwhal login node. The mpscp command is similar to the scp command, but has a different underlying means of data transfer, and may enable greater transfer rate. The mpscp command has the same syntax as scp.

% scp local_file user@narwhal.navydsrc.hpc.mil:/target_dir

Both scp and mpscp can be used to send multiple files. This command transfers all files with the .txt extension to the same destination directory.

% scp *.txt user@narwhal.navydsrc.hpc.mil:/target_dir

The example below uses the secure file transfer protocol (sftp) to connect to Narwhal, then uses the sftp "cd" and "put" commands to change to the destination directory and copy a local file there. The sftp "quit" command ends the sftp session. Use the sftp "help" command to see a list of all sftp commands.

% sftp user@narwhal.navydsrc.hpc.mil

sftp> cd target_dir
sftp> put local_file
sftp> quit

The Kerberized file transfer protocol (kftp) command differs from sftp in that your username is not specified on the command line, but given later when prompted. The kftp command may not be available in all environments.

% kftp narwhal.navydsrc.hpc.mil

username> user
kftp> cd target_dir
kftp> put local_file
kftp> quit

Windows users may use a graphical file transfer protocol (ftp) client such as FileZilla.

4. User Environmentto top

4.1. User Directories

The following user directories are provided for all users on Narwhal.

4.1.1. Home Directory

When you log on to Narwhal, you will be placed in your home directory, /p/home/username. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes, and may be used to store small user files. It has an initial quota of 250 GB. $HOME is not intended as permanent storage, but files stored in $HOME are not subject to being purged.

4.1.2. Work Directory

Narwhal has two file systems, /p/work1 and /p/work2, for the temporary storage of data files needed for executing programs. You may access your /p/work1 directory by using the $WORKDIR environment variable, which is set for you upon login. Your $WORKDIR directory has an initial quota of 100 TB, and your /p/work2 directory has an initial quota of 25 TB. The work file systems will fill up as jobs run. Please review the Purge Policy and be mindful of your disk usage.

REMEMBER: /p/work1 and /p/work2 are "scratch" file systems and are not backed up. You are responsible for managing your files in these directories by backing up files to the archive server and deleting unneeded files when your jobs end. See the section below on Archive Usage for details.

All of your jobs should execute from your $WORKDIR directory, not $HOME. While not technically forbidden, jobs that are run from $HOME are subject to smaller disk space quotas and have a much greater chance of failing if problems occur with that resource.

To avoid unusual errors that can arise from two jobs using the same scratch directory, a common technique is to create a unique subdirectory for each batch job by including the following lines in your batch script:

TMPD=${WORKDIR}/${PBS_JOBID}
mkdir -p ${TMPD}

4.1.3. Center Directory

The Center-Wide File System (CWFS) provides file storage that is accessible from Narwhal's login nodes. The CWFS allows file transfers and other file and directory operations from Narwhal using simple Linux commands. Each user has their own directory in the CWFS. The name of your CWFS directory may vary between machines and between centers, but the environment variable $CENTER will always refer to this directory.

The example below shows how to copy a file from your work directory on Narwhal to the CWFS ($CENTER). While logged into Narwhal, copy your file from your Narwhal work directory to the CWFS.

% cp $WORKDIR/filename $CENTER

4.2. Shells

The following shells are available on Narwhal: csh, bash, ksh, tcsh, zsh, and sh. To change your default shell, please email a request to require@hpc.mil. Your preferred shell will become your default shell on the Narwhal cluster within 1-2 working days.

4.3. Environment Variables

A number of environment variables are provided by default on all HPCMP HPC systems. We encourage you to use these variables in your scripts where possible. Doing so will help to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems.

4.3.1. Login Environment Variables

The following environment variables are common to both the login and batch environments:

Common Environment Variables
Variable	Description
$ARCHIVE_HOME	Your directory on the archive server.
$ARCHIVE_HOST	The host name of the archive server.
$BC_HOST	The generic (not node specific) name of the system.
$CC	The currently selected C compiler. This variable is automatically updated when a new compiler environment is loaded.
$CENTER	Your directory on the Center-Wide File System (CWFS).
$CSE_HOME	This variable contains the path to the base directory of the default installation of the Computational Science Environment (CSE) installed on a particular compute platform. The variable is set when the cseinit module is loaded. (See BC policy FY13-01 for CSE details.)
$CSI_HOME	The directory containing the following list of heavily used application packages: ABAQUS, Accelrys, ANSYS, CFD++, Cobalt, EnSight, Fluent, GASP, Gaussian, LS-DYNA, MATLAB, and TotalView, formerly known as the Consolidated Software Initiative (CSI) list. Other application software may also be installed here by our staff.
$CXX	The currently selected C++ compiler. This variable is automatically updated when a new compiler environment is loaded.
$DAAC_HOME	The directory containing DAAC supported visualization tools ParaView, VisIt, and EnSight.
$F77	The currently selected Fortran 77 compiler. This variable is automatically updated when a new compiler environment is loaded.
$F90	The currently selected Fortran 90 compiler. This variable is automatically updated when a new compiler environment is loaded.
$HOME	Your home directory on the system.
$JAVA_HOME	The directory containing the default installation of Java.
$KRB5_HOME	The directory containing the Kerberos utilities.
$PET_HOME	The directory containing the tools formerly installed and maintained by the PET staff. This variable is deprecated and will be removed from the system in the future. Certain tools will be migrated to $CSE_HOME, as appropriate.
$PROJECTS_HOME	A common directory where group-owned and supported applications and codes may be maintained for use by members of a group. Any project may request a group directory under $PROJECTS_HOME.
$SAMPLES_HOME	The Sample Code Repository. This is a collection of sample scripts and codes provided and maintained by our staff to help users learn to write their own scripts. There are a number of ready-to-use scripts for a variety of applications.
$WORKDIR	Your work directory on the local temporary file system (i.e., local high-speed disk).

4.3.2. Batch-Only Environment Variables

In addition to the variables listed above, the following variables are automatically set only in your batch environment. That is, your batch scripts will be able to see them when they run. These variables are supplied for your convenience and are intended for use inside your batch scripts.

Batch-Only Environment Variables
Variable	Description
$BC_CORES_PER_NODE	The number of cores per node for the compute node on which a job is running.
$BC_MEM_PER_NODE	The approximate maximum user-accessible memory per node (in integer MB) for the compute node on which a job is running.
$BC_MPI_TASKS_ALLOC	The number of MPI tasks allocated for a job.
$BC_NODE_ALLOC	The number of nodes allocated for a job.

4.4. Modules

Software modules are a convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found. Narwhal uses "modules" to initialize your environment with COTS application software, system commands and libraries, compiler suites, environment variables, and PBS batch system commands.

A number of modules are loaded automatically as soon as you log in. To see the modules which are currently loaded, use the "module list" command. To see the entire list of available modules, use "module avail". You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.

4.5. Archive Usage

All of our HPC systems have access to an online archival mass storage system that provides long-term storage for users' files on a petascale archival storage system that resides on a robotic tape library system. A 60-TB disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

Tape file systems have very slow access times. The tapes must be robotically pulled from the tape library, mounted in one of the limited number of tape drives, and wound into position for file archival or retrieval. For this reason, users should always tar up their small files in a large tarball when archiving a significant number of files. A good maximum target size for tarballs is about 500 GB or less. At that size, the time required for file transfer and tape I/O is reasonable. Files larger than 1 TB may span more than one tape, which will greatly increase the time required for both archival and retrieval.

The environment variables $ARCHIVE_HOST and $ARCHIVE_HOME are automatically set for you. $ARCHIVE_HOST can be used to reference the archive server, and $ARCHIVE_HOME can be used to reference your archive directory on the server. These variables can be used when transferring files to/from archive.

4.5.1. Archival Command Synopsis

A synopsis of the main archival utilities is listed below. For information on additional capabilities, see the Archive User Guide or read the online man pages that are available on each system. These commands are non-Kerberized and can be used in batch submission scripts if desired.

Copy one or more files from the archive server
rcp ${ARCHIVE_HOST}:${ARCHIVE_HOME}/file_name ${WORKDIR}/proj1
List files and directory contents on the archive server
rsh ${ARCHIVE_HOST} ls [lsopts] [file/dir ...]
Create directories on the archive server
rsh ${ARCHIVE_HOST} mkdir [-p] [-s] dir1 [dir2 ...]
Copy one or more files to the archive server
rcp ${WORKDIR}/proj1/file_name ${ARCHIVE_HOST}:${ARCHIVE_HOME}/proj1

5. Program Developmentto top

5.1. Programming Models

Narwhal supports two parallel programming models: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP). A Hybrid MPI/OpenMP programming model is also supported. MPI is an example of the message- or data-passing models, while OpenMP uses only shared memory on a node by spawning threads. And, the hybrid model combines both models.

5.1.1. Message Passing Interface (MPI)

Narwhal's default Message Passing Interface (MPI) stack supports the MPI 3.1 standard, as documented by the MPI Forum. MPI is part of the software support for parallel programming across a network of computer systems through a technique known as message passing. MPI establishes a practical, portable, efficient, and flexible standard for message passing that makes use of the most attractive features of a number of existing message-passing systems, rather than selecting one of them and adopting it as the standard. See "man intro_mpi" for additional information.

When creating an MPI program on Narwhal, ensure the following:

That the default MPI module (cray-mpich) has been loaded. To check this, run the "module list" command. If cray-mpich is not listed, use the following command to load the module:
module load cray-mpich

That the source code includes one of the following lines:

INCLUDE "mpif.h"        ## for Fortran, or
#include <mpi.h>        ## for C/C++

To compile an MPI program, use the following:

ftn -o mpi_program.exe mpi_program.f     ## for Fortran, or
cc -o mpi_program.exe mpi_program.c	 ## for C
CC -o mpi_program.exe mpi_program.cpp	 ## for C++

To run an MPI program within a batch script, use the following command:

mpiexec -n mpi_procs mpi_program.exe [user_arguments]

where mpi_procs is the number of MPI processes being started. For example:

#### The following starts 256 MPI processes; 128 on each node, one per core.
## It requests 2 nodes, each with 128 cores and 128 processes per node.
#PBS -l select=2:ncpus=128:mpiprocs=128
mpiexec -n 256 ./a.out

The mpiexec command launches executables across a set of compute nodes allocated to your job and, by default, utilizes all cores and nodes available to your job. When each member of the parallel application has exited, mpiexec exits.

A common concern for MPI users is the need for more memory for each process. By default, one MPI process is started on each core of a node. This means that on Narwhal, the available memory on the node is split 128 ways. To allow an individual process to use more of the node's memory, you need to start fewer processes on that node. To accomplish this, the user must request more nodes from PBS, but only run on a certain number of them. For example, the following select statement requests 8 nodes, with 128 cores per node, but only uses 12 of those cores for MPI processes:

#### The following starts 96 MPI processes; only 12 on each node.
## It requests 8 nodes, each with 128 cores and 12 processes per node.
#PBS -l select=8:ncpus=128:mpiprocs=12
mpiexec -n 96 ./a.out

For more information about mpiexec, type "man mpiexec". The aprun command can also be used to launch parallel executables. For more information on aprun, type "man aprun".

5.1.2. Open Multi-Processing (OpenMP)

OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications. It supports shared-memory multiprocessing programming in C, C++, and Fortran, and consists of a set of compiler directives, library routines, and environment variables that influence compilation and run-time behavior.

When creating an OpenMP program on Narwhal, ensure the following:

If using OpenMP functions (for example, omp_get_wtime), that the source code includes one of the following lines:
```
INCLUDE 'omp.h'      ## for Fortran,  or
#include <omp.h>    ## for C/C++
```
Or, if the code is written in Fortran 90 or later, the following line may be used instead:

USE omp_lib
That the compile command includes an option to reference the OpenMP library. The Cray, Intel, GNU, AOCC, and NVIDIA compilers support OpenMP, and each one uses a different option.

To compile an OpenMP program, use the following examples:

For C codes:

cc -fopenmp -o OpenMP_program.exe OpenMP_program.c ## Cray, GNU, AOCC, NVIDIA
cc -qopenmp -o OpenMP_program.exe OpenMP_program.c ## Intel

For C++ codes:

CC -fopenmp -o OpenMP_program.exe OpenMP_program.c   ## Cray, GNU, AOCC, NVIDIA
CC -mp-nonuma -o OpenMP_program.exe OpenMP_program.c ## Intel

For Fortran codes:

ftn -fopenmp -o OpenMP_program.exe OpenMP_program.f  ## Cray, GNU, AOCC, NVIDIA 
ftn -openmp -o OpenMP_program.exe OpenMP_program.f   ## Intel

See section 5.2 for additional information on available compilers.

When running OpenMP applications, the $OMP_NUM_THREADS environment variable must be used to specify the number of threads. For example:

export OMP_NUM_THREADS=128
./OpenMP_program [user_arguments]

In the example above, the application starts the OpenMP_program on one node and spawns a total of 128 threads. Since Narwhal has 128 cores per compute node, this yields 1 thread per core.

5.1.3. Hybrid Processing (MPI/OpenMP)

An application built with the hybrid model of parallel programming can run on Narwhal using both OpenMP and Message Passing Interface (MPI). In hybrid applications, OpenMP threads can be spawned by MPI processes, but MPI calls should not be issued from OpenMP parallel regions or by an OpenMP thread.

When creating a hybrid (MPI/OpenMP) program on Narwhal, follow the instructions in the MPI and OpenMP sections above for creating your program. Then use the compilation instructions for OpenMP.

To run a hybrid program within a batch script, set $OMP_NUM_THREADS equal to the number of threads in the team. Then launch your program using mpiexec as follows:

####  MPI/OpenMP on 4 nodes, 8 MPI processes total with 6 threads each
## request 4 nodes, each with 128 cores and 2 processes per node
#PBS -l select=4:ncpus=128:mpiprocs=2:ompthreads=6
## assign 8 MPI processes with 2 MPI processes per node
export OMP_NUM_THREADS=6
mpiexec -n 8 ./mpi_program

5.2. Available Compilers

Narwhal has five programming environment suites.

Cray (HPE Cray compiler)
Intel
GNU
AMD Optimizing C/C++ Compiler (AOCC)
NVIDIA HPC SDK

Cray provides a convenient set of compiler wrappers that should be used for compiling and linking programs. The wrapper invokes the back-end compiler in the currently loaded programming environment. Flags and options given to the wrappers are passed to the back-end compiler as appropriate. Additional information about the wrappers can be found in the man pages for cc, CC and ftn.

Common Compiler Commands
Language	Cray	Intel	GNU	AOCC	NVIDIA	Serial/Parallel
C	cc	cc	cc	cc	cc	Serial/Parallel
C++	CC	CC	CC	CC	CC	Serial/Parallel
Fortran 77	ftn	ftn	ftn	ftn	ftn	Serial/Parallel
Fortran 90	ftn	ftn	ftn	ftn	ftn	Serial/Parallel

The Cray programming environment (CCE), PrgEnv-cray, is loaded for you by default. To use a different programming suite, you will need to change the programming environment via the 'module swap' command. See Relevant Modules (below) to learn how.

5.2.1. Cray Compiler Environment

HPE Cray Programming environment has C, C++ and Fortran compilers that are designed to extract increased performance from the systems, regardless of the underlying architecture.

The following table lists some of the more common options that you may use:

Cray Compiler Options
Option	Purpose
-c	Generate intermediate object file but do not attempt to link.
-`I` directory	Search in directory for include or module files.
-L directory	Search in directory for libraries.
-o outfile	Name executable "`outfile`" rather than the default "`a.out`".
-Olevel	Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-f free	Process Fortran codes using free form.
-f pic, or -f PIC	Generate position-independent code (PIC) for shared libraries.
-h byteswapio	Big-endian files; the default is little-endian.
-g	Generate symbolic debug information.
-h noomp	Disable OpenMP directives.
-f openmp	Recognize OpenMP directives.
-h dynamic	Enable dynamic linking of libraries at run time. Use "dynamic" with cc, CC and ftn compiler wrappers.
-Ktrap=fp	Trap floating point, divide by zero, and overflow exceptions.

Detailed information about these and other compiler options is available in the Cray compiler (craycc, crayCC, and crayftn) man pages on Narwhal.

5.2.2. Intel Compiler Environment

The following table lists some of the more common options that you may use:

Intel Compiler Options
Option	Purpose
-c	Generate intermediate object file but do not attempt to link.
-`I` directory	Search in directory for include or module files.
-L directory	Search in directory for libraries.
-o outfile	Name executable "`outfile`" rather than the default "`a.out`".
-Olevel	Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-free	Process Fortran codes using free form.
-fpic, or -fPIC	Generate position-independent code for shared libraries.
-convert big_endian	Big-endian files; the default is little-endian.
-g	Generate symbolic debug information.
-qopenmp	Recognize OpenMP directives.
-Bdynamic	Compiling using shared objects.
-fpe-all=0	Trap floating point, divide by zero, and overflow exceptions.

Detailed information about these and other compiler options is available in the Intel compiler (ifort, icc, and icpc) man pages on Narwhal.

5.2.3. GNU Compiler Environment

The GNU Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:

GNU Compiler Options
Option	Purpose
-c	Generate intermediate object file but do not attempt to link.
-`I` directory	Search in directory for include or module files.
-L directory	Search in directory for libraries.
-o outfile	Name executable "`outfile`" rather than the default "`a.out`".
-Olevel	Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-g	Generate symbolic debug information.
-fconvert=big-endian	Big-endian files; the default is little-endian.
-Wextra -Wall	Turns on increased error reporting.

Detailed information about these and other compiler options is available in the GNU compiler (gcc, g++, and gfortran) man pages on Narwhal.

5.2.4. AOCC Compiler Environment

The AOCC compiler system is a high performance, production quality code generation tool. The AOCC environment provides various options to users when building and optimizing C, C++, and Fortran applications. AOCC uses LLVM's Clang as the compiler and driver for C and C++ programs, and Flang as the compiler and driver for Fortran programs.

The following table lists some of the more common options that you may use:

AOCC Compiler Options
Option	Purpose
-c	Generate intermediate object file but do not attempt to link.
-`I` directory	Search in directory for include or module files.
-L directory	Search in directory for libraries.
-o outfile	Name executable "`outfile`" rather than the default "`a.out`".
-Olevel	Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-g	Generate symbolic debug information.
-ffree-form	Compile free form Fortran.

Detailed information about these and other compiler options is available in the AOCC compiler (clang, clang++, and flang) man pages on Narwhal.

5.2.5. NVIDIA Compiler Environment

The NVIDIA HPC Software Development Kit (SDK) is a comprehensive suite of compilers and libraries enabling users to program the entire HPC platform from the GPU to the CPU and through the interconnect. The NVIDIA HPC SDK C, C++, and Fortran compilers support GPU acceleration of HPC modeling and simulation applications with standard C++ and Fortran, OpenACC directives and CUDA.

The following table lists some of the more common options that you may use:

NVIDIA HPC SDK Compiler Options
Option	Purpose
-c	Generate intermediate object file but do not attempt to link.
-`I` directory	Search in directory for include or module files.
-L directory	Search in directory for libraries.
-o outfile	Name executable "`outfile`" rather than the default "`a.out`".
-Olevel	Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-g	Generate symbolic debug information.
-acc	Enable parallelization using OpenACC directives. By default the compilers will parallelize and offload OpenACC regions to an NVIDIA GPU.
-gpu	Control the type of GPU for which code is generated, the version of CUDA to be targeted, and serveral other aspects of GPU code generation.
-Minfo=acc	Prints diagnostic information to STDERR regarding whether the compiler was able to produce GPU code successfully.

Detailed information about these and other compiler options is available in the NVIDIA compiler (nvc, nvc++, and nvfortran) man pages on Narwhal.

5.3. Relevant Modules

By default, Narwhal loads the Cray programming environment. The Intel, GNU, AOCC and NVIDIA environments are also available. To use any of these, the Cray module must be unloaded and replaced with the one you wish to use. To do this, use the "module swap" command. For more information on using modules, see the Modules User Guide.

Programming Environment Modules
Module	Module Name
Cray CCE	PrgEnv-cray
Intel	PrgEnv-intel
GNU	PrgEnv-gnu
AOCC	PrgEnv-aocc
NVIDIA	PrgEnv-nvidia

5.4. Libraries

Cray's Scientific and Math libraries and Intel's Match Kernel Library (MKL) are both available on Narwhal. In addition, an extensive suite of math and science libraries are available in the $CSE_HOME directory.

5.4.1. Cray Scientific and Math Libraries (CSML) LibSci

The Cray Scientific and Math Libraries (CSML, also known as LibSci) is a collection of numerical routines optimized for best performance on Cray systems. All programming environment modules load cray-libsci by default, except when noted.

Most users, on most codes, will find they obtain better performance by using calls to Cray LibSci routines in their applications instead of calls to public domain or user-written versions.

Note: Additionally, Cray EX systems also make use of the Cray LibSci Accelerator routines for enhanced performance on GPU-equipped compute nodes. For more information, see the intro_libsci_acc man page.

The CSML collection contains the following Scientific Libraries:

Basic Linear Algebra Subroutines (BLAS) - Levels 1, 2, and 3
C interface to the legacy BLAS (CBLAS)
Basic Linear Algebra Communication Subprograms (BLACS)
Linear Algebra Package (LAPACK)
Scalable LAPACK (ScaLAPACK) (distributed-memory parallel set of LAPACK routines)
Fast Fourier Transform (FFT)
Fastest FFT in the West Routines (FFTW versions 2 and 3)
Accelerated BLAS and LAPACK routines (LibSci_ACC)

Two libraries unique to Cray are also included:

Iterative Refinement Toolkit (IRT)
CrayBLAS (library of BLAS routines autotuned for Cray EX series)

The IRT routines may be used by setting the environment variable $IRT_USE_SOLVERS to 1, or by coding an explicit call to an IRT routine. Additional information is available by using the "man intro_irt" command.

5.4.2. Intel Math Kernel Library (MKL)

Narwhal provides the Intel Math Kernel Library (Intel MKL), a set of numerical routines tuned specifically for Intel platform processors and optimized for math, scientific, and engineering applications. The routines, which are available via both FORTRAN and C interfaces, include:

LAPACK plus BLAS (Levels 1, 2, and 3)
ScaLAPACK plus PBLAS (Levels 1, 2, and 3)
Fast Fourier Transform (FFT) routines for single-precision, double-precision, single-precision complex, and double-precision complex data types
Discrete Fourier Transforms (DFTs)
Fast Math and Fast Vector Library
Vector Statistical Library Functions (VSL)
Vector Transcendental Math Functions (VML)

The MKL routines are part of the Intel Programming Environment as Intel's MKL is bundled with the Intel Compiler Suite.

Linking to the Intel Math Kernel Libraries can be complex and is beyond the scope of this introductory guide. Documentation explaining the full feature set along with instructions for linking can be found at the Intel Math Kernel Library documentation page.

Intel also makes a link advisor available to assist users with selecting proper linker and compiler options: http://software.intel.com/sites/products/mkl.

5.4.3. Additional Math Libraries

There is also an extensive set of Math libraries available in the $CSE_HOME directory (/app/CSE) on Narwhal. Information about these libraries can be found on the Baseline Configuration website at BC policy FY13-01.

5.5. Debuggers

Cray provides a collection of debugging tools that are referred to as the Cray Debugger Support Tools (CDST). Narwhal also has the following debugging tools: Forge, and GNU Project Debugger (gdb) to assist users in debugging their code.

Narwhal supports a variety of debugging options ranging from simple command-line debuggers to separately licensed third-party GUI tools. These options are capable of performing a variety of tasks ranging from analyzing core files to setting breakpoints and debugging running parallel programs.

As a rule, your code must be compiled using the -g command line option.

5.5.1. Cray Debugger Support Tools (CDST)

Cray provides a collection of debugging packages that include the following: gdb4hpc, valgrind4hpc, STAT, ATP and CCDB.

5.5.1.1. Gdb4hpc

gdb4hpc is a GDB-based parallel debugger used to debug applications compiled with CCE, Intel and GNU C, C++ and Fortran compilers. It allows users to either launch an application or attach to an already-running application. This debugger can be accessed by loading the gdb4hpc module. Detailed information about this debugger can be found in the gdb4hpc man page on Narwhal.

5.5.1.2. Valgrind4hpc

Valgrind4pc is a Valgrind-based debugging tool used to detect memory leaks and errors in parallel application. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior. This tool can be accessed by loading the valgrindhpc module. Detailed information can be found in the valgrind4hpc man page on Narwhal.

5.5.1.3. Stack Trace Analysis Tool (STAT)

STAT is a single merged stack backtrace tool to analyze application behavior at the function level. It helps trace down the cause of crashes. This tool can be accessed by loading the cray-stat module. Detailed information can be found in the STAT man page on Narwhal.

5.5.1.4. Abnormal Termination Processing (ATP)

ATP is a scalable core file generation and analysis tool for analyzing crashes. It helps determine the cause of crashes. This tool can be accessed by loading the atp module. Detailed information can be found in the atp man page on Narwhal.

5.5.1.5. Cray Comparative Debugger (CCDB)

CCDB is Cray's next generation debugging tool. It features a GUI interface that extends the comparative debugging capabilities of lgdb, enabling users to easily compare data structures between two executing applications. This tool can be accessed by loading the cray-ccdb module. Detailed information can be found in the ccdb man page on Narwhal.

5.5.2. Forge (formerly DDT)

DDT is a debugger that supports threads, MPI, OpenMP, C/C++, Fortran, Co-Array Fortran, UPC, and CUDA. Memory debugging and data visualization are supported for large-scale parallel applications. The Parallel Stack Viewer is a unique way to see the program state of all processes and threads at a glance.

DDT is a graphical debugger, therefore you must be able to display it via a UNIX X-Windows interface. There are several ways to do this including SSH X11 Forwarding, HPC Portal, or SRD. Follow the steps below to use DDT via X11 Forwarding or Portal.

Choose a remote display method: X11 Forwarding, HPC Portal, or SRD. X11 Forwarding is easier but typically very slow. HPC Portal requires no extra clients and is typically fast. SRD requires an extra client but is typically fast and may be a good option if doing a significant amount of X11 Forwarding.
1. To use X11 Forwarding:
  1. Ensure an X server is running on your local system. Linux users will likely have this by default, but MS Windows users need to install a third-party X Windows solution. There are various options available.
  2. For Linux users, connect to Onyx using ssh -Y. Windows users need to use PuTTY with X11 forwarding enabled (Connection->SSH->X11->Enable X11 forwarding).
2. Or to use HPC Portal:
  1. Navigate to https://centers.hpc.mil/portal.
  2. Select HPC Portal at NAVY.
  3. Select XTerm -> NAVY -> Narwhal.
3. Or, for information on using SRD, see the SRD User Guide.
Compile your program with the -g option.
Submit an interactive job:

qsub -l select=1:ncpus=128:mpiprocs=128 -A Project_ID -l walltime=00:30:00 -q debug -l application=Application_Name -X -I
Load the Forge DDT module:

module load forge
Start program execution:

ddt -n 4 ./my_mpi_program arg1 arg2 ...
(Example for 4 MPI ranks)
The DDT window will pop up. Verify the application name and number of MPI processes. Click "Run".

An example of using Forge can be found in $SAMPLES_HOME/Programming/Forge_Example on Narwhal.

5.5.3. GDB

The GNU Project Debugger (gdb) is a source-level debugger that can be invoked either with a program for execution or a running process id. To launch your program under gdb for debugging, use the following command:

gdb a.out corefile

To attach gdb to a program that is already executing on a node, use the following command:

gdb a.out pid

For more information, the GDB manual can be found at http://www.gnu.org/software/gdb.

5.6. Code Profiling and Optimization

Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization, which increases the performance of a program by modifying certain aspects for increased efficiency.

We provide three profiling tools: CrayPat, gprof, and codecov to assist you in the profiling process. In addition, a basic overview of optimization methods with information about how they may improve the performance of your code can be found in Performance Optimization Methods (below).

5.6.1. CrayPat

The Cray Performance Measurement and Analysis Tools (CrayPat) are a suite of optional utilities that enable the user to capture and analyze performance data generated during the execution of a program on a Cray system. The CrayPat suite consists of the following major components: CrayPat, CrayPat-lite, Cray Apprentice2, Reveal and the Cray PAPI components.

The simplest approach is to use CrayPat-lite that provides basic performance analysis information automatically, with minimum user interaction. CrayPat-lite can be accessed by loading the perftools-lite module.

The first step is to load the perftools-base module.

module load perftools-base

After loading perftools-base, the command:

module avail perftools

will show all of the modules associated with Perftools. The following man pages are available after loading the perftools-base module: intro_craypat, pat_build, pat_help, craypat_lite, grid_order, app2, and reveal.

For additional information, see the Cray Performance Measurement and Analysis Tools User Guide - S-2376 on Cray's documentation website: https://pubs.cray.com.

5.6.2. gprof

The GNU Project Profiler (gprof) is a profiler that shows how your program is spending its time and which function calls are made. To profile code using gprof, use the "-pg" option during compilation.

5.6.3. Codecov

The Intel Code Coverage Tool (codecov) can be used in numerous ways to improve code efficiency and increase application performance. The tool leverages Profile-Guided optimization technology (discussed below). Coverage can be specified in the tool as file-level, function-level or block-level. Another benefit to this tool is the ability to compare the profiles of two application runs to find where the optimizations are making a difference.

5.6.4. Additional Profiling Tools

There is also a set of profiling tools available in the $CSE_HOME (/app/CSE) directory on Narwhal. Information about these tools may be found on the Baseline Configuration Web site at BC policy FY13-01.

5.6.5. Program Development Reminders

If an application is not programmed for distributed memory, then only the cores on a single node can be used. This is limited to 128 cores on Narwhal.

Keep the system architecture in mind during code development. For instance, if your program requires more memory than is available on a single node, then you will need to parallelize your code so that it can function across multiple nodes.

5.6.6. Compiler Optimization Options

The "-Olevel" option enables code optimization when compiling. The level that you choose (0-4) will determine how aggressive the optimization will be. Increasing levels of optimization may increase performance significantly, but you should note that a loss of precision may also occur. There are also additional options that may enable further optimizations. The following table contains the most commonly used options.

Compiler Optimization Options
Option	Description	Compiler Suite
-O0	No Optimization. (default in GNU)	All
-O1	Scheduling within extended basic blocks is performed. Some register allocation is performed. No global optimization.	All
-O2	Level 1 plus traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer. Generally safe and beneficial. (default in PGI, GNU, & Intel)	All
-O3	Levels 1 and 2 plus more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable. Generally beneficial.	All
-fipa-*	The GNU compilers automatically enable IPA at various -O levels. To set these manually, see the options beginning with -fipa in the gcc man page.	GNU
-finline-functions	Enables function inlining within a single file	Intel
-ipon	Enables interprocedural optimization between files and produces up to n object files	Intel
-inline-level=n	Number of levels of inlining (default: n=2)	Intel
-opt-reportn	Generate optimization report with n levels of detail	Intel
-xHost	Compiler generates code with the highest instruction set available on the processor.	Intel

5.6.7. Performance Optimization Methods

Optimization generally increases compilation time and executable size, and may make debugging difficult. However, it usually produces code that runs significantly faster. The optimizations that you can use will vary depending on your code and the system on which you are running.

Note: Before considering optimization, you should always ensure that your code runs correctly and produces valid output.

In general, there are four main categories of optimization:

Global Optimization
Loop Optimization
Interprocedural Analysis and Optimization(IPA)
Function Inlining

Global Optimization

A technique that looks at the program as a whole and may perform any of the following actions:

Perform on code over all its basic blocks
Perform control-flow and data-flow analysis for an entire program
Detect all loops, including those formed by IF and GOTOs statements and perform general optimization
Constant propagation
Copy propagation
Dead store elimination
Global register allocation
Invariant code motion
Induction variable elimination

Loop Optimization

A technique that focuses on loops (for, while, etc.,) in your code and looks for ways to reduce loop iterations or parallelize the loop operations. The following types of actions may be performed:

Vectorization - rewrites loops to improve memory access performance. Some compilers may also support automatic loop vectorization by converting loops to utilize low-level hardware instructions and registers if they meet certain criteria.
Loop unrolling - (also known as "unwinding") replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization.
Parallelization - divides loop operations over multiple processors where possible.

Interprocedural Analysis and Optimization (IPA)

A technique that allows the use of information across function call boundaries to perform optimizations that would otherwise be unavailable.

Function Inlining

A technique that seeks to reduce function call and return overhead. It:

Is used with functions that are called numerous times from relatively few locations.
Allows a function call to be replaced by a copy of the body of that function.
May create opportunities for other types of optimization
May not be beneficial. Improper use may increase code size and actually result in less efficient code.

6. Batch Schedulingto top

6.1. Scheduler

The Portable Batch System (PBS) is currently running on Narwhal. It schedules jobs and manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch request. PBS is able to manage both single-processor and multiprocessor jobs. The PBS module is automatically loaded for you when you log in.

6.2. Queue Information

The following table describes the PBS queues available on Narwhal:

Queue Descriptions and Limits on Narwhal
Priority	Queue Name	Max Wall Clock Time	Max Cores Per Job	Description
Highest	urgent	24 Hours	16,384	Jobs belonging to DoD HPCMP Urgent Projects
	frontier	168 Hours	65,536	Jobs belonging to DoD HPCMP Frontier Projects
	high	168 Hours	32,768	Jobs belonging to DoD HPCMP High Priority Projects
	debug	30 Minutes	8,192	Time/resource-limited for user testing and debug purposes
	HIE	24 Hours	3,072	Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide.
	viz	24 Hours	128	Visualization jobs
	standard	168 Hours	32,768	Standard jobs
	mla	24 Hours	128	Machine Learning Accelerated jobs
	smla	24 Hours	128	Machine Learning Accelerated jobs
	dmla	24 Hours	128	Machine Learning Accelerated jobs
	serial	168 Hours	1	Serial jobs
	bigmem	96 Hours	1,280	Large-memory jobs
	transfer	48 Hours	N/A	Data transfer for user jobs. See the Navy DSRC Archive Guide, section 5.2.
Lowest	background	4 Hours	1,024	User jobs that are not charged against the project allocation

6.3. Interactive Logins

When you log in to Narwhal, you will be running in an interactive shell on a login node. The login nodes provide login access for Narwhal and support such activities as compiling, editing, and general interactive use by all users. Please note the Login Node Abuse policy. The preferred method to run resource intensive executions is to use an interactive batch session.

6.4. Interactive Batch Sessions

An interactive session on a compute node is possible using the PBS qsub command with the "-I" option from a login node. Once PBS has scheduled your request to the specified queue, you will be directly logged into a compute node, and this session can last as long as your requested wall time. For example:

qsub -l select=N1:ncpus=128:mpiprocs=N2 -A Project_ID -q queue_name -l walltime=HHH:MM:SS -I

You must specify the number of nodes requested (N1), the number of processes per node (N2), the desired maximum walltime, your project ID, and a job queue. Valid values for N2 are between 1 and 128.

Your interactive batch sessions will be scheduled just as normal batch jobs are scheduled depending on the other queued batch jobs, so it may take quite a while. Once your interactive batch shell starts, you can run or debug interactive applications, post-process data, etc.

At this point, you can launch parallel applications on your assigned set of compute nodes by using the mpiexec command. You can also run interactive commands or scripts on this node.

6.5. Batch Request Submission

PBS batch jobs are submitted via the qsub command. The format of this command is:

qsub [ options ] batch_script_file

qsub options may be specified on the command line or embedded in the batch script file by lines beginning with "#PBS".

For a more thorough discussion of PBS batch submission on Narwhal, see the Narwhal PBS Guide.

6.6. Batch Resource Directives

Batch resource directives allow you to specify to PBS how your batch jobs should be run and what resources your job requires. Although PBS has many directives, you only need to know a few to run most jobs.

The basic syntax of PBS directives is as follows:

#PBS option[[=]value]

where some options may require values to be included. For example, to start a 64-process job, you would request one node of 128 cores and specify that you will be running 64 processes per node:

#PBS -l select=1:ncpus=128:mpiprocs=64

The following directives are required for all jobs:

Required PBS Directives
Directive	Value	Description
-A	Project_ID	Name of the project
-q	queue_name	Name of the queue
-`l`	select=N1:ncpus=128:mpiprocs=N2	Standard compute node: N1 = Number of nodes N2 = MPI processes per node (N2 can be between 1 and 128)
-`l`	select=N1:ncpus=128:mpiprocs=N2:ngpus=1	GPU node: N1 = Number of nodes N2 = MPI processes per node (N2 can be between 1 and 128)
-`l`	select=N1:ncpus=128:mpiprocs=N2:bigmem=1	Large-memory node: N1 = Number of nodes N2 = MPI processes per node (N2 can be between 1 and 128)
-l	walltime=HHH:MM:SS	Maximum wall clock time

Optional Directives
Directive	Value	Description
-N	Job Name	Name of the job.
-e	File name	Redirect standard error to the name file.
-o	File name	Redirect standard output to the name file.
-j	oe	Merge standard error and standard output into standard output.
-`l` application	application_name	Identify the application being used.
-`I`		Request an interactive batch shell.
-V		Export all environment variables to the job.
-v	Variable list	Export specific environment variables to the job.

A more complete listing of batch resource directives is available in the Narwhal PBS Guide.

6.7. Launch Commands

On Narwhal, the following command can be used to launch MPI executables from within a batch job:

mpiexec -n #_of_MPI_tasks ./mpijob.exe

aprun -n #_of_MPI_tasks ./mpijob.exe

For OpenMP executables, no launch command is needed.

6.8. Sample Scripts

While it is possible to include all PBS directives at the qsub command line, the preferred method is to embed the PBS directives within the batch request script using "#PBS". The first script, below, is a basic example and contains all of the required directives, some frequently used optional directives, and common script components to run a serial code. The second example starts 256 processes on 2 nodes of 128 cores each, with one MPI process per core. More thorough examples are available in the Narwhal PBS Guide and in the Sample Code Repository ($SAMPLES_HOME) on Narwhal.

The following example is a good starting template for a batch script to run a serial job for one hour:

#!/bin/bash ## Specify your shell
#
# Specify name of the job
#PBS -N serialjob
#
# Append std output to file serialjob.out 
#PBS -o serialjob.out
#
# Append std error to file serialjob.err
#PBS -e serialjob.err
#
# Specify Project ID to be charged (Required)
#PBS -A Project_ID
#
# Request wall clock time of 1 hour (Required)
#PBS -l walltime=01:00:00
#
# Specify queue name (Required)
#PBS -q standard
#
# Specify the number cores (Required)
#PBS -l select=1:ncpus=1
#
#PBS -S /bin/bash
# Change to the specified directory
cd $WORKDIR
#
# Execute the serial executable on 1 core
./serial_fort.exe
# End of batch job

The first few lines tell PBS to save the standard output and error output to the given files, and to give the job a name. Skipping ahead, we estimate the run-time to be about one hour and know that this is acceptable for the standard batch queue. We need one core in total, so we request one core.

The following example is a good starting template for a batch script to run a parallel (MPI) job for two hours:

#!/bin/bash
## The first line (above) specifies the shell to use for parsing 
## the remaining lines of the batch script.
#
## Required PBS Directives --------------------------------------
#PBS -A Project_ID
#PBS -q standard
#PBS -l select=2:ncpus=128:mpiprocs=128
#PBS -l walltime=02:00:00
#PBS -l application=Application_Name
#
## Optional PBS Directives --------------------------------------
#PBS -N Test_Run_1
#PBS -j oe
#PBS -V
#PBS -S /bin/bash
#
## Execution Block ----------------------------------------------
# Environment Setup
# cd to your personal directory in the scratch file system
cd $WORKDIR
#
# create a job-specific subdirectory based on JOBID and cd to it
JOBID=`echo $PBS_JOBID | cut -d '.' -f 1`
if [ ! -d $JOBID ]; then
  mkdir -p $JOBID
fi
cd $JOBID
#
# Launching
# copy executable from $HOME and submit it
cp $HOME/mympiprog.exe .
mpiexec -n 256 ./mympiprog.exe > mympiprog.out
#
# Clean up
# archive your results
# Using the "here document" syntax, create a job script
# for archiving your data.
cd $WORKDIR
rm -f archive_job
cat > archive_job << END
#!/bin/bash
#PBS -l walltime=06:00:00
#PBS -q transfer
#PBS -A Project_ID
#PBS -l select=1:ncpus=1
#PBS -j oe
#PBS -S /bin/bash
cd $WORKDIR
rsh $ARCHIVE_HOST mkdir $ARCHIVE_HOME/$JOBID
rcp -r $JOBID $ARCHIVE_HOST:$ARCHIVE_HOME/
rsh $ARCHIVE_HOST ls -l $ARCHIVE_HOME/$JOBID
# Remove scratch directory from the file system.
rm -rf $JOBID
END
#
# Submit the archive job script.
qsub archive_job
# End of batch job

The first few lines tell PBS to save the standard output and error output to the given files, and to give the job a name. Skipping ahead, we estimate the run-time to be about 2 hours and know that this is acceptable for the standard batch queue. The next couple of lines set the total number of cores and the number of cores per node for the job. This job is requesting 256 total cores and 128 cores per node allowing the job to run on 2 nodes. The default value for number of cores per node is 128.

Additional examples are available in the Narwhal PBS Guide and in the Sample Code Repository ($SAMPLES_HOME) on Narwhal.

6.9. PBS Commands

The following commands provide the basic functionality for using the PBS batch system:

qsub: Used to submit jobs for batch processing.
qsub [ qsub_options ] my_job_script

qstat: Used to check the status of submitted jobs.
qstat PBS_JOBID ## check one job qstat -u my_user_name ## check all of user's jobs

qdel: Used to kill queued or running jobs.
qdel PBS_JOBID

A more complete list of PBS commands is available in the Narwhal PBS Guide.

6.10. Determining Time Remaining in a Batch Job

In batch jobs, knowing the time remaining before the workload management system will kill the job enables the user to write restart files or even prepare input for the next job submission. However, adding such capability to an existing source code requires knowledge to query the workload management system as well as parsing the resulting output to determine the amount of remaining time.

The DoD HPCMP allocated systems now have the library, WLM_TIME, as an easy way to provide the remaining time in the batch job to C, C++, and Fortran programs. The library can be added to your job using the following:

For C:

#include <wlm_time.h>
void wlm_time_left(long int *seconds_left)

For Fortran:

SUBROUTINE WLM_TIME_LEFT(seconds_left)
INTEGER seconds_left

For C++:

extern "C" {
#include <wlm_time.h>
}
void wlm_time_left(long int *seconds_left)

For simplicity, wall-clock-time remaining is returned as an integer value of seconds.

To simplify usage, a module file defines the process environment, and a pkg-config metadata file defines the necessary compiler linker options:

For C:

module load wlm_time
$(CC) ctest.c `pkg-config --cflags --libs wlm_time`

For Fortran:

module load wlm_time
$(F90) test.f90 `pkg-config --cflags-only-I --libs wlm_time`

For C++:

module load wlm_time
$(CXX) Ctest.C `pkg-config --cflags --libs wlm_time`

WLM_TIME works currently with PBS. The developers expect that WLM_TIME will continue to provide a uniform interface encapsulating the underlying aspects of the workload management system.

6.11. Advance Reservations

An Advance Reservation Service (ARS) is available on Narwhal for reserving cores for use, starting at a specific date/time, and lasting for a specific number of hours. The ARS is accessible via most modern web browsers at https://reservation.hpc.mil. Authenticated access is required. The ARS User Guide is available on HPC Centers.

7. Software Resourcesto top

7.1. Application Software

A complete listing with installed versions can be found on our software page. The general rule for all COTS software packages is that the two latest versions will be maintained on our systems. For convenience, modules are also available for most COTS software packages.

7.2. Useful Utilities

The following utilities are available on Narwhal. For command-line syntax and examples of usage, please see each utility's online man page.

Baseline Configuration Commands and Tools
Name	Description
archive	Perform basic file-handling operations on the archive system
bcmodule	An enhanced version of the standard module command
check_license	Check the status of licenses for HPCMP shared applications
cqstat	Display information about running and pending batch jobs
mpscp	High-performance remote file copy
node_use	Display the amount of free and used memory for login nodes
qflag	Report a problem with a batch job to the HPCMP Help Desk
qhist	Print tracing information for a batch job
qpeek	Display spooled stdout and stderr for an executing batch job.
qview	Display information about batch jobs and queues
scampi	Transfer data between systems using multiple streams and sockets
show_queues	Report current batch queue status, usage, and limits
show_storage	Display disk/file usage and quota information
show_usage	Display CPU allocation and usage by subproject
tube	Copy files to a remote system using Kerberos host authentication

7.3. Sample Code Repository

The Sample Code Repository is a directory that contains examples for COTS batch scripts, building and using serial and parallel programs, data management, and accessing and using serial and parallel math libraries. The $SAMPLES_HOME environment variable contains the path to this area, and is automatically defined in your login environment. Below is a listing of the examples p rovided in the Sample Code Repository on Narwhal.

Sample Code Repository on Narwhal
Application_Name Use of the application name resource.
Sub-Directory	Description
application_names	README and list of valid strings for application names intended for use in every PBS script preamble. The HPCMP encourages applications not specifically named in the list to be denoted as "other".
Applications Application-specific examples; interactive job submit scripts; use of the application name resource; software license use.
abaqus	Basic batch script and input deck for an Abaqus application.
ansys	Basic batch script and input deck for an ANSYS application.
cfd++	Basic batch script and input deck for a CFD++ application.
fluent	Basic batch script and input deck for a FLUENT (now ACFD) application.
lsdyna	Basic batch script and input deck for a LS-DYNA application.
openfoam	Sample input files and job script for the OpenFOAM application.
starccm+	Basic batch script and input deck for a STARCCM+ application.
totalview	Instructions on how to use the TotalView debugger to debug MPI code.
Data_Management Archiving and retrieving files; Lustre striping; file searching; $WORKDIR use.
MPSCP_Example	Directory containing a README file giving examples of how to use the mpscp command to transfer files between Narwhal and remote systems.
Postprocess_Example	Sample batch script showing how to submit a transfer queue job at the end of your computation job.
Transfer_Queue_Example	Sample batch script showing how to stage data out after a job executes using the transfer queue.
Parallel_Environment MPI, OpenMP, and hybrid examples; large number of nodes jobs; single-core jobs; large memory jobs; running multiple applications within a single batch job.
Calculate_Prime_MPI	Sample code and scripts for compiling and executing an MPI code
Hybrid	Simple MPI/OpenMP hybrid example and batch script.
Large_Memory_Jobs	A sample large-memory jobs script.
MPI_PBS_Examples	Sample PBS job scripts for SGI MPT and IntelMPI codes built with the Intel and GNU compilers.
Multiple_Jobs_per_Node	Sample PBS job scripts for running multiple jobs on the same node.
OpenMP	A simple Open MP example and batch script.
Programming Basic code compilation; debugging; use of library files; static vs. dynamic linking; Makefiles; Endian conversion.
Core_Files	Provides Examples of three core file viewers.
Forge_Example	Using Forge to debug a small example code in an interactive batch job.
GPU_Examples	Several examples demonstrating use of system tools, compilation techniques, and PBS scripts to generate and execute code using the GPGPU accelerators.
Large_Memory_Example	Simple example of how to run a job using Large-Memory nodes.
Memory_Usage	Sample build and script that shows how to determine the amount of memory being used by a process.
Open_Files_Limits	This example discusses the maximum number of simultaneously open files an MPI process may have, and how to adjust the appropriate settings in a PBS job.
SharedObject_Compile_Example	Simple example of creating a SO (Shared Object) library and using it to compile and running against it on the compute nodes.
Timers_Fortran	Serial Timers using Fortran Intrinsics f77 and f90/95.
Totalview_Example	Instructions on how to use the TotalView debugger to debug MPI code.
User_Environment Use of modules; customizing the login environment.
Module_Swap_Example	Instructions for using module swap command.
Environment_Variables	README file describing environment variables set on Narwhal by the Baseline Configuration Team intended to facilitate easier migration and maintenance of project software and supporting scripts.
Workload_Management Basic batch scripting; use of the transfer queue; job arrays; job dependencies; Secure Remote Desktop; job monitoring.
BatchScript_Example	Basic PBS batch script example.
Core_Info_Example	Sample code for generating the MPI process/core or OpenMP thread/core associativity in compute jobs.
Hybrid_Example	Simple MPI/OpenMP hybrid example and batch script.
Interactive_Example	Instructions on how to submit an interactive PBS job.
Job_Array_Example	Instructions and example job script for using job arrays.
Job_Dependencies_Example	Example scripts on how to use PBS job dependencies
Transfer_Queue_Example	Sample batch script for data transfer.

8. Links to Vendor Documentationto top

HPE Cray EX (Narwhal)User Guide

Table of Contents

1. Introductionto top

1.1. Document Scope and Assumptions

1.2. Policies to Review

1.2.1. Login Node Abuse Policy

1.2.2. Workspace Purge Policy

1.3. Obtaining an Account

1.4. Requesting Assistance

2. System Configurationto top

2.1. System Summary

2.2. Processors

2.3. Memory

2.4. Operating System

2.5. File Systems

2.5.1. /p/home

2.5.2. /p/work1

2.5.3. /p/work2

2.5.4. /archive

2.5.5. /p/cwfs/

2.6. Peak Performance

3. Accessing the Systemto top

3.1. Kerberos

3.2. Logging In

3.3. File Transfers

4. User Environmentto top

4.1. User Directories

4.1.1. Home Directory

4.1.2. Work Directory

4.1.3. Center Directory

4.2. Shells

4.3. Environment Variables

4.3.1. Login Environment Variables

4.3.2. Batch-Only Environment Variables

4.4. Modules

4.5. Archive Usage

4.5.1. Archival Command Synopsis

5. Program Developmentto top

5.1. Programming Models

5.1.1. Message Passing Interface (MPI)

To compile an MPI program, use the following:

To run an MPI program within a batch script, use the following command:

5.1.2. Open Multi-Processing (OpenMP)

5.1.3. Hybrid Processing (MPI/OpenMP)

5.2. Available Compilers

5.2.1. Cray Compiler Environment

5.2.2. Intel Compiler Environment

5.2.3. GNU Compiler Environment

5.2.4. AOCC Compiler Environment

5.2.5. NVIDIA Compiler Environment

5.3. Relevant Modules

5.4. Libraries

5.4.1. Cray Scientific and Math Libraries (CSML) LibSci

5.4.2. Intel Math Kernel Library (MKL)

5.4.3. Additional Math Libraries

5.5. Debuggers

5.5.1. Cray Debugger Support Tools (CDST)

5.5.1.1. Gdb4hpc

5.5.1.2. Valgrind4hpc

5.5.1.3. Stack Trace Analysis Tool (STAT)

5.5.1.4. Abnormal Termination Processing (ATP)

5.5.1.5. Cray Comparative Debugger (CCDB)

5.5.2. Forge (formerly DDT)

5.5.3. GDB

5.6. Code Profiling and Optimization

5.6.1. CrayPat

5.6.2. gprof

5.6.3. Codecov

5.6.4. Additional Profiling Tools

5.6.5. Program Development Reminders

5.6.6. Compiler Optimization Options

5.6.7. Performance Optimization Methods

Global Optimization

Loop Optimization

Interprocedural Analysis and Optimization (IPA)

Function Inlining

6. Batch Schedulingto top

6.1. Scheduler

6.2. Queue Information

6.3. Interactive Logins

HPE Cray EX (Narwhal)
User Guide