Navy DSRC Introduction and Policy Guide

Table of Contents

1. Introductionto top

1.1. Purpose

This document provides an overview of the Navy DSRC. This guide is intended to offer assistance to users and their S/AAAs in determining which systems will best meet specific computational needs.

To contact us with questions, comments, or suggestions about this guide, please visit the Contact Us page for complete contact information.

1.2. Overview of Supported CTAs

The Navy Department of Defense (DoD) Supercomputing Resource Center (Navy DSRC) is organizationally located with the Naval Meteorology and Oceanography Command (NAVMETOCCOM) and is collocated with the headquarters (Commander, Naval Meteorology and Oceanography Command - CNMOC) at the John C. Stennis Space Center, MS. NAVMETOCCOM/CNMOC provides oceanographic support to the Department of Defense through a wide range of oceanographic modeling, prediction and data collection techniques.

The Navy DSRC, formerly the NAVO MSRC, was the second of the four major shared DoD High Performance Computing (HPC) centers to be formed under the auspices of the DoD HPC Modernization Program. Now one of five such centers, the Navy DSRC provides specialized support in the following critical defense computational technology areas (CTAs):

Supported CTAs
CTADescription
CWO Climate/Weather/Ocean Modeling and Simulation
CFD Computational Fluid Dynamics
CSM Computational Structural Mechanics
CCM Computational Chemistry, Biology, and Materials Science
CEA Computational Electromagnetics and Acoustics
ENS Electronics, Networking, and Systems/C4I
SIP Signal/Image Processing
FMS Forces Modeling and Simulation
EQM Environmental Quality Modeling and Simulation
IMT Integrated Modeling and Test Environments
SAS Space and Astrophysical Science

DoD Supercomputing Resource Centers provide DoD scientists and engineers with most of the program's computational resources. Each center supports a full range of centralized systems and services, including vector machines, scalable parallel systems, clustered workstations, DoD scientific visualization resources, and training.

1.3. Requesting Assistance

The HPC Help Desk is available to help users with unclassified problems, issues, or questions. Analysts are on duty 8:00 a.m. - 8:00 p.m. Eastern, Monday - Friday (excluding Federal holidays).

You can contact the Navy DSRC in any of the following ways for after-hours support and for support services not provided by the HPC Help Desk :

  • E-mail: dsrchelp@navydsrc.hpc.mil
  • Phone: 1-800-993-7677 or (228) 688-7677
  • Fax: (228) 688-4356
  • U.S. Mail:
    Navy DoD Supercomputing Resource Center
    1002 Balch Boulevard
    Stennis Space Center, MS 39522-5001

For more detailed contact information, please see the Contact Us page.

1.4. Obtaining an Account

The process of getting an account on the HPC systems at any of the DSRCs begins with getting an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account". If you do not yet have a pIE User Account, please visit HPC Centers: Obtaining An Account and follow the instructions there. If you need assistance with any part of this process, please contact the HPC Help Desk at accounts@helpdesk.hpc.mil.

1.5. Visitor Information

If you are planning to visit the Navy DSRC, it is important that you review the instructions on the Planning a Visit page. This page contains important information including pre-trip and on-arrival instructions that you will need to know to ensure that your visit to our center goes smoothly.

2. Hardware, Network, and Softwareto top

All HPC systems currently in operation at the Navy DSRC are seamlessly integrated with the Mass Storage Archive Server and the Defense Research and Engineering Network (DREN) via many high-speed networking technologies.

2.1. High Performance Computing

2.1.1. HPE SGI 8600 (Gaffney)

Gaffney is an HPE SGI 8600 system. The login and compute nodes are populated with Intel Xeon Platinum 8168 (Skylake) processors clocked at 2.7 GHz. Gaffney uses the Intel Omni-Path interconnect in a Non-Blocking Fat Tree as its high-speed network for MPI messages and I/O traffic. Gaffney uses Lustre to manage its parallel file system that targets the disk RAID arrays.

Gaffney has 736 compute nodes that share memory only on the node; memory is not shared across the nodes.

Each standard compute node has two 24-core processors (48 cores) sharing 192 GBytes of DDR4 memory, with no user-accessible swap space.

Each large-memory compute node has two 24-core processors (48 cores) sharing 768 GBytes of DDR4 memory, with no user-accessible swap space.

Each GPU compute node, exclusively available via the gpu queue, has two 24-core processors (48 cores) and one NVIDA Tesla P100 GPU with its own Red Hat Enterprise Linux operating system, sharing 384 GBytes of DDR4 memory, with no user-accessible swap space.

Gaffney is rated at 3.05 peak PFLOPS and has 5.5 PBytes (formatted) of parallel disk storage.

Gaffney is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, I/O, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

gaffney.navydsrc.hpc.mil
HPE SGI 8600 - 3.05 PFLOPS
Login Nodes Compute Nodes
Standard Memory Large Memory GPU
Accelerated
Total Nodes 8 704 16 32
Operating System RHEL
Cores/Node 48 48 + 1 GPU
(1 x 3,584 GPU cores)
Core Type Intel Xeon Platinum 8168 Intel Xeon Platinum 8168
+NVIDIA Tesla P100
Core Speed 2.7 GHz
Memory/Node 384 GBytes 192 GBytes 768 GBytes 384 GBytes
+16 GBytes
Accessible Memory/Node 380 GBytes 180 GBytes 744 GBytes 372 GBytes
Memory Model Shared on node. Shared on node.
Distributed across cluster.
Interconnect Type Intel Omni-Path
File Systems on Gaffney
Path Capacity Type
/p/home
($HOME)
346 TBytesLustre
/p/work1
($WORKDIR)
5.5 PBytesLustre
/p/work2111 TBytesLustre
/p/work3350 TBytesLustre on SSD

For detailed information on using Gaffney, see the Gaffney User Guide.

2.1.2. HPE SGI 8600 (Koehr)

Koehr is an HPE SGI 8600 system. The login and compute nodes are populated with Intel Xeon Platinum 8168 (Skylake) processors clocked at 2.7 GHz. Koehr uses the Intel Omni-Path interconnect in a Non-Blocking Fat Tree as its high-speed network for MPI messages and I/O traffic. Koehr uses Lustre to manage its parallel file system that targets the disk RAID arrays.

Koehr has 736 compute nodes that share memory only on the node; memory is not shared across the nodes.

Each standard compute node has two 24-core processors (48 cores) sharing 192 GBytes of DDR4 memory, with no user-accessible swap space.

Each large-memory compute node has two 24-core processors (48 cores) sharing 768 GBytes of DDR4 memory, with no user-accessible swap space.

Each GPU compute node, exclusively available via the gpu queue, has two 24-core processors (48 cores) and one NVIDA Tesla P100 GPU with its own Red Hat Enterprise Linux operating system, sharing 384 GBytes of DDR4 memory, with no user-accessible swap space.

Koehr is rated at 3.05 peak PFLOPS and has 5.5 PBytes (formatted) of parallel disk storage.

Koehr is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, I/O, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

koehr.navydsrc.hpc.mil
HPE SGI 8600 - 3.05 PFLOPS
Login Nodes Compute Nodes
Standard Memory Large Memory GPU
Accelerated
Total Nodes 8 704 16 32
Operating System RHEL
Cores/Node 48 48 + 1 GPU
(1 x 3,584 GPU cores)
Core Type Intel Xeon Platinum 8168 Intel Xeon Platinum 8168
+NVIDIA Tesla P100
Core Speed 2.7 GHz
Memory/Node 384 GBytes 192 GBytes 768 GBytes 384 GBytes
+16 GBytes
Accessible Memory/Node 380 GBytes 180 GBytes 744 GBytes 372 GBytes
Memory Model Shared on node. Shared on node.
Distributed across cluster.
Interconnect Type Intel Omni-Path
File Systems on Koehr
Path Capacity Type
/p/home
($HOME)
346 TBytesLustre
/p/work1
($WORKDIR)
5.5 PBytesLustre
/p/work2111 TBytesLustre
/p/work3350 TBytesLustre on SSD

For detailed information on using Koehr, see the Koehr User Guide.

2.1.3. Cray XC40 (Conrad)

Conrad is a Cray XC40. The login and compute nodes are populated with Intel Xeon E5-2698v3 (Haswell-EP) processors clocked at 2.3 GHz. Conrad uses a dedicated Cray Aries high-speed network for MPI messages and I/O traffic. Conrad uses Lustre to manage its parallel file system that targets arrays of SAS disk drives.

Conrad has 1,699 compute nodes that share memory only on the node; memory is not shared across the nodes.

Each standard compute node has two 16-core processors that operate under a Cray Linux Environment (CLE), sharing 128 GBytes of DDR3 memory, with no user-accessible swap space.

Each large-memory compute node has two 16-core processors that operate under a Cray Linux Environment (CLE), sharing 512 GBytes of DDR3 memory, with no user-accessible swap space.

Each hybrid node on Conrad, exclusively available via the phi queue, has one 12-core processor with its own Red Hat Enterprise Linux operating system, and 64 GBytes of memory, with limited user-accessible swap space. Hybrid nodes also contain one Intel Xeon Phi 5120D coprocessor, which is comprised of 60 cores and holds 8 GBytes of internal memory.

Conrad is rated at 2.0 peak PFLOPS and has 2.29 PBytes (formatted) of parallel disk storage.

Conrad is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, I/O, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

conrad.navydsrc.hpc.mil
Cray XC40 - 2 PFLOPS
Login Nodes Compute Nodes
Standard Memory Large Memory Phi
Accelerated
Total Nodes 6 1,523 8 168
Operating System SLES Cray Linux Environment
Cores/Node 32 12 + 1 Phi
(1 x 60 Phi cores)
Core Type Intel Xeon E5-2698v3 Intel Xeon E5-2696v2
+Intel Xeon 5120D Phi
Core Speed 2.3 GHz 2.4 GHz
+1.05 GHz
Memory/Node 256 GBytes 128 GBytes 256 GBytes 64 GBytes
+8 GBytes
Accessible Memory/Node 240 GBytes 125 GBytes 246 GBytes 63 GBytes
+7.5 GBytes
Memory Model Shared on node. Shared on node.
Distributed across cluster.
Interconnect Type Ethernet Cray Aries / Dragonfly
File Systems on Conrad
Path Capacity Type
/p/home
($HOME)
113 TBytesLustre
/p/work1
($WORKDIR)
2.0 PBytesLustre

For detailed information on using Conrad, see the Conrad User Guide.

2.1.4. Cray XC40 (Gordon)

Gordon is a Cray XC40. The login and compute nodes are populated with Intel Xeon E5-2698v3 (Haswell-EP) processors clocked at 2.3 GHz. Gordon uses a dedicated Cray Aries high-speed network for MPI messages and I/O traffic. Gordon uses Lustre to manage its parallel file system that targets arrays of SAS disk drives.

Gordon has 1,699 compute nodes that share memory only on the node; memory is not shared across the nodes.

Each standard compute node has two 16-core processors that operate under a Cray Linux Environment (CLE), sharing 128 GBytes of DDR3 memory, with no user-accessible swap space.

Each large-memory compute node has two 16-core processors that operate under a Cray Linux Environment (CLE), sharing 256 GBytes of DDR3 memory, with no user-accessible swap space.

Each hybrid node on Gordon, exclusively available via the phi queue, has one 12-core processor with its own Red Hat Enterprise Linux operating system, and 64 GBytes of memory, with limited user-accessible swap space. Hybrid nodes also contain one Intel Xeon Phi 5120D coprocessor, which is comprised of 60 cores and holds 8 GBytes of internal memory.

Gordon is rated at 2.0 peak PFLOPS and has 2.29 PBytes (formatted) of parallel disk storage.

Gordon is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, I/O, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

gordon.navydsrc.hpc.mil
Cray XC40 - 2 PFLOPS
Login Nodes Compute Nodes
Standard Memory Large Memory Phi
Accelerated
Total Nodes 6 1,523 8 168
Operating System SLES Cray Linux Environment
Cores/Node 32 12 + 1 Phi
(1 x 60 Phi cores)
Core Type Intel Xeon E5-2698v3 Intel Xeon E5-2696v2
+Intel Xeon 5120D Phi
Core Speed 2.3 GHz 2.4 GHz
+1.05 GHz
Memory/Node 256 GBytes 128 GBytes 256 GBytes 64 GBytes
+8 GBytes
Accessible Memory/Node 240 GBytes 125 GBytes 246 GBytes 63 GBytes
+7.5 GBytes
Memory Model Shared on node. Shared on node.
Distributed across cluster.
Interconnect Type Ethernet Cray Aries / Dragonfly
File Systems on Gordon
Path Capacity Type
/p/home
($HOME)
113 TBytesLustre
/p/work1
($WORKDIR)
2.0 PBytesLustre

For detailed information on using Gordon, see the Gordon User Guide.

2.2. Mass Storage Archive Server (Newton)

There is one Oracle T5-4 system, Newton, which makes up the Resilient Mass Storage Server (RMSS). The system is configured with four 16-core 3.6-GHz processors, 1 TByte of main memory, and over 336 TBytes of hard disk storage. For information on using the archive system, see the Archive User Guide.

2.3. Network Connectivity

The Navy DSRC is a primary node on the new and improved Defense Research and Engineering Network III, or DREN III. DREN III is a robust, high-speed, low-latency network providing 50-Mbit/sec to 100-Gbit/sec connectivity to Department of Defense High Performance Computing Modernization Program (DoD HPCMP) centers nationwide. We connect to the DREN III Wide Area Network (WAN) via a 10-Gbit/sec circuit linking us to the DREN III 100-Gbit/sec backbone WAN infrastructure.

The Navy DSRC Local Area Network (LAN), a 40-Gbit/sec fault-tolerant backbone infrastructure, provides primary connectivity to the Navy DSRC HPC and mass storage assets. Currently, all HPCs and mass storage systems utilize 10-Gbit/sec connections to the Navy DSRC LAN. Users of the Navy DSRC assets are able to take advantage of this high-performance connectivity for interactive and data transfer functions.

2.4. Software Environment

All Navy DSRC systems run derivatives of the UNIX System V operating system with vendor-specific enhancements. A large variety of compiler environments, math libraries, programming tools and third-party analysis applications are available on the DSRC systems.

HPC Software Listings
SystemSoftware Listing
HPE SGI 8600 (Gaffney) https://www.navydsrc.hpc.mil/software/index.html?sys=Gaffney
HPE SGI 8600 (Koehr) https://www.navydsrc.hpc.mil/software/index.html?sys=Koehr
Cray XC-40 (Conrad) https://www.navydsrc.hpc.mil/software/index.html?sys=Conrad
Cray XC-40 (Gordon) https://www.navydsrc.hpc.mil/software/index.html?sys=Gordon

3. Data Storageto top

The Navy DSRC data storage consists of local home directories on each system, temporary disk storage on each system and long-term storage on the Resilient Mass Storage Server (RMSS). Files stored on the RMSS are subject to migration to off-line status that is controlled by Sun's Storage and Archive Manager/Quick File System (SAM/QFS) software.

3.1. Permanent File Storage

Users are allocated a home directory (referenced locally with the $HOME environment variable) on each Navy DSRC system with 1 GByte of non-migrated storage. $HOME is not backed up by the Center; therefore users are responsible for maintaining backup copies of any files in this directory.

3.2. Temporary File Storage

3.2.1. $WORKDIR

Each Navy DSRC system is configured with a large quantity of high-speed disk storage configured as the $WORKDIR file system. $WORKDIR is the globally accessible, high-speed working storage primarily for interactive and batch processing. Batch jobs use large amounts of temporary space. There are no limits on the size of individual files. Users are responsible for managing their own files in the $WORKDIR areas. The $WORKDIR file system is not backed up by the Center. Users are responsible for maintaining backup copies of any files in the temporary file system. Any files older than 21 days are subject to the purge process. Users can access their temporary storage by using the $WORKDIR environment variable.

Temporary Space Allocations on HPC Systems
System$WORKDIR
HPE SGI 8600 (Gaffney)10 TBytes
HPE SGI 8600 (Koehr)10 TBytes
Cray XC-40 (Conrad)10 TBytes
Cray XC-40 (Gordon)10 TBytes
3.2.2. $CENTER

The Center-Wide File System (CWFS) provides file storage that is accessible from the Gaffney, Koehr, Conrad, and Gordon login nodes and from the HPC Portal. The CWFS permits file transfers between the HPC systems using simple Linux commands. Each user has their own directory in the CWFS. The name of your CWFS directory may vary between machines and between centers, but the environment variable $CENTER will always refer to this directory. The CWFS was chartered by the HPCMP to allow its user community to store and keep data online for 120 days that might otherwise be purged from the HPC systems. This allows users greater flexibility for additional data analysis. The current scrubber policy for the Navy DSRC adheres to a 120-day retention requirement for $CENTER. Users storing data on the CWFS are encouraged to archive their data to long-term storage if more than 120 days are required.

The example below shows how to copy a file from your work directory on an HPC login node to $CENTER on the CWFS. The CWFS ($CENTER) is mounted on Conrad and Gordon.

While logged into Conrad or Gordon copy your file from your HPC work directory to the CWFS.

> cp $WORKDIR/filename $CENTER

CWFS File System Scrubber Policy
File System Percentage Full Retention Period
/p/cwfs ($CENTER) <= 50% No limit on retained data
/p/cwfs ($CENTER) 50% < x <=99% 120 Days

3.3. Archival File Storage

All of our HPC systems have access to an online archival mass storage system that provides long-term storage for users' files on a petascale archival storage system that resides on a robotic tape library system. A 70-TByte disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

The environment variables $ARCHIVE_HOST and $ARCHIVE_HOME are automatically set for you. $ARCHIVE_HOST can be used to reference the archive server, and $ARCHIVE_HOME can be used to reference your archive directory on the server. These can be used when transferring files to/from archive. For information on using the archive system, see the Archive User Guide.

4. Processing Environmentto top

4.1. Determining the Correct HPC System

Determining the correct HPC System for your needs can be a complex task. The following are just a few of the factors that might influence your choice:

4.1.1. Software Availability

If your work depends upon a specific Commercial Off-The-Shelf (COTS) application, you can verify it's availability on any system in the HPCMP by checking the Consolidated Software List. Software information for Navy DSRC systems is also available on our local software page. If you can't find the application that you need, contact the HPC Help Desk for assistance.

4.1.2. Hardware Requirements

To ensure that your jobs will have access to sufficient cores and memory to run as needed, you can review the hardware specifications on our Hardware page. Additional details are available in each of the HPC User Guides, available from the Documentation page.

4.1.3. Queue Limits

If your jobs require exceptionally long run times or if you need an exceptionally large number of cores, you should verify that queue limits on the system you choose allow both the number of cores and run time that you need. To check this, see our Queue Summary page.

4.2. Processing Environment Overview and Philosophy

Navy DSRC provides both an interactive and a batch submission environment. Batch queue environments are available on all of the systems. The batch environment is the primary environment for most user work. All of the HPC systems at the Navy DSRC use the PBS batch queue system.

The batch queue environments allow users to submit, monitor and terminate their own batch jobs. This capability is intended for jobs requiring large amounts of memory and/or CPU time that generally run for many hours. Through the batch queue environments, the user submits a job either from the command line or through a shell script. Resource requirements (e.g., CPU time and number of processors) or runtime parameters (e.g., output file redirection) can be issued on the command line or embedded in the shell script for the batch job to be executed.

4.3. Job Scheduling/Queuing Environment and Policies

4.3.1. HPE SGI 8600 (Gaffney) Queue Usage Policies
Summary of Queues on the HPE SGI 8600 - Gaffney
Priority Queue
Name
Job
Class
Max Wall
Clock Time
Max Cores
Per Job
Comments
Highest urgent Urgent 24 Hours 768 Designated urgent projects by DoD HPCMP
Down Arrow for decreasing priority frontier Frontier 168 Hours 19,200 Frontier projects only
high High 168 Hours 15,840 Designated high-priority projects by service/agency
debug Debug 30 Minutes 2,400 User diagnostic jobs
standard Standard 168 Hours 8,168 Normal priority user jobs
gpu N/A 24 Hours 48 GPU-accelerated jobs
transfer N/A 48 Hours N/A Data transfer jobs
bigmem N/A 96 Hours 768 Large-memory jobs
Lowest background Background 4 Hours 1,200 User jobs that will not be charged against the project allocation.
4.3.2. HPE SGI 8600 (Koehr) Queue Usage Policies
Summary of Queues on the HPE SGI 8600 - Koehr
Priority Queue
Name
Job
Class
Max Wall
Clock Time
Max Cores
Per Job
Comments
Highest urgent Urgent 24 Hours 768 Designated urgent projects by DoD HPCMP
Down Arrow for decreasing priority frontier Frontier 168 Hours 19,200 Frontier projects only
high High 168 Hours 15,840 Designated high-priority projects by service/agency
debug Debug 30 Minutes 2,400 User diagnostic jobs
standard Standard 168 Hours 8,168 Normal priority user jobs
gpu N/A 24 Hours 48 GPU-accelerated jobs
transfer N/A 48 Hours N/A Data transfer jobs
bigmem N/A 96 Hours 768 Large-memory jobs
Lowest background Background 4 Hours 1,200 User jobs that will not be charged against the project allocation.
4.3.3. Cray XC40 (Conrad) Queue Usage Policies
Summary of Queues on the Cray XC40 - Conrad
Priority Queue
Name
Job
Class
Max Wall
Clock Time
Max Cores
Per Job
Comments
Highest urgent Urgent 24 Hours 768 Designated urgent projects by DoD HPCMP
Down Arrow for decreasing priority frontier Frontier 168 Hours 16,000 Frontier projects only
high High 168 Hours 16,000 Designated high-priority projects by service/agency
debug Debug 30 Minutes 3,072 User diagnostic jobs
standard Standard 168 Hours 8,000 Normal priority user jobs
phi N/A 168 Hours 2,376 Phi-accelerated jobs
bigmem N/A 24 Hours 224 Large-memory jobs
transfer N/A 48 Hours N/A Data transfer jobs
Lowest background Background 4 Hours 2,048 User jobs that will not be charged against the project allocation.
4.3.4. Cray XC40 (Gordon) Queue Usage Policies
Summary of Queues on the Cray XC40 - Gordon
Priority Queue
Name
Job
Class
Max Wall
Clock Time
Max Cores
Per Job
Comments
Highest urgent Urgent 24 Hours 768 Designated urgent projects by DoD HPCMP
Down Arrow for decreasing priority frontier Frontier 168 Hours 16,000 Frontier projects only
high High 168 Hours 16,000 Designated high-priority projects by service/agency
debug Debug 30 Minutes 3,072 User diagnostic jobs
standard Standard 168 Hours 8,000 Normal priority user jobs
phi N/A 168 Hours 4,032 Phi-accelerated jobs
bigmem N/A 24 Hours 224 Large-memory jobs
transfer N/A 48 Hours N/A Data transfer jobs
Lowest background Background 4 Hours 768 User jobs that will not be charged against the project allocation.

4.4. Interactive CPU-time Limits

The Navy DSRC has implemented a 15 minute (900 second) interactive processing limit on login nodes for processes running outside of the batch scheduler. This also applies to systems that do not have a batch scheduler installed. If you were to run an application on a login node, the application would be allowed to accrue 900 seconds-worth of CPU time, not real time, before being terminated. This policy has been put in place in order to protect interactive access for all users.

Interactive CPU-Time Limits
SystemCPU Time
HPE SGI 8600 (Gaffney)15 Minutes
HPE SGI 8600 (Koehr)15 Minutes
Cray XC40 (Conrad)15 Minutes
Cray XC40 (Gordon)15 Minutes
Oracle T5-4 (Newton)15 Minutes

5. Navy DSRC Specific Documentationto top

On-line documentation and information can be found through the Navy DSRC Web site, the message of the day (MOTD) that is displayed when logging on any system, and manual pages via the man command.