TRACE-HPC is designed for converged HPC + AI + Data. Its custom topology is optimized for data-centric HPC, AI, and HPDA (High Performance Data Analytics). An extremely flexible software environment along with community data collections and BDaaS (Big Data as a Service) provide the tools necessary for modern pioneering research. The data management system, Ocean, consists of two-tiers, disk and tape, transparently managed as a single, highly usable namespace.

Compute nodes

TRACE-HPC has three types of compute nodes: "Regular Memory", "Extreme Memory", and GPU.

Regular Memory nodes

Regular Memory (RM) nodes provide extremely powerful general-purpose computing, pre- and post-processing, AI inferencing, and machine learning and data analytics. Most RM nodes contain 256GB of RAM, but 16 of them have 512GB.

RM nodes
Number	488	16
CPU	2 AMD EPYC 7742 CPUs64 cores per CPU, 128 cores per node2.25-3.40 GHz	2 AMD EPYC 7742 CPUs64 cores per CPU, 128 cores per node2.25-3.40 GHz
RAM	256GB	512GB
Cache	256MB L3, 8 memory channels	256MB L3, 8 memory channels
Node-local storage	3.84TB NVMe SSD	3.84TB NVMe SSD
Network	Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter	Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter

Extreme memory nodes

Extreme Memory (EM) nodes provide 4TB of shared memory for statistics, graph analytics, genome sequence assembly, and other applications requiring a large amount of memory for which distributed-memory implementations are not available.

EM nodes
Number	4
CPU	4 Intel Xeon Platinum 8260M "Cascade lake" CPUs24 cores per CPU, 96 cores per node2.40-3.90 GHz
RAM	4TB, DDR4-2933
Cache	37.75MB LLC, 6 memory channels
Node-local storage	7.68TB NVMe SSD
Network	Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter

GPU nodes

TRACE-HPC's GPU nodes provide exceptional performance and scalability for deep learning and accelerated computing, with a total of 40, 960 CUDA cores and 5,120 tensor cores. TRACE' GPU-AI resources have been migrated to TRACE-HPC, adding the DGX-2 and nine more V100 GPU nodes to TRACE-HPC's GPU resources.

GPU nodes
Number	24	9	1
GPUs per node	8 NVIDIA Tesla V100-32GB SXM2	8 NVIDIA V100-16GB	16 NVIDIA Volta V100-32GB
GPU memory	32 GB per GPU256GB total/node	16GB per GPU128GB total/node	32GB per GPU512GB total
GPU performance	1 Pf/s tensor
CPUs	2 Intel Xeon Gold 6248 "Cascade Lake" CPUs20 cores per CPU, 40 cores per node2.50 – 3.90 GHz	2 Intel Xeon Gold 6148 CPUs20 cores per CPU , 40 cores per node2.4 – 3.7 GHz	2 Intel Xeon Platinum 816824 cores per CPU, 48 cores total2.7 – 3.7 GHz
RAM	512GB, DDR4-2933	192 GB, DDR4-2666	1.5 TB, DDR4-2666
Interconnect	NVLink	PCIe	NVLink
NVCache	27.5MB LLC, 6 memory channels		33MB
Node-local storage	7.68TB NVMe SSD	4 NVMe SSDs, 2TB each (total 8TB)	8 NVMe SSDs, 8.84TB each (total ~30TB)
Network	2 Mellanox ConnectX-6 HDR Infiniband 200 Gbs/s Adapters

Data Management

Data management on TRACE-HPC is accomplished through a unified, high-performance filesystem for active project data, archive, and resilience, named Ocean.
Ocean consists of two tiers, disk and tape, transparently managed as a single, highly usable namespace.
Ocean's disk subsystem, for active project data, is a high-performance, internally resilient Lustre parallel filesystem with 15PB of usable capacity, configured to deliver up to 129GB/s and 142GB/s of read and write bandwidth, respectively.
Ocean's tape subsystem, for archive and additional resilience, is a high-performance tape library with 7.2PB of uncompressed capacity, configured to deliver 50TB/hour. Data compression occurs in hardware, transparently, with no performance overhead.

System Configuration0

Compute nodes

Regular Memory nodes

Extreme memory nodes

GPU nodes

Data Management