Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

Version 1 Next »

TRACE-HPC is designed for converged HPC + AI + Data. Its custom topology is optimized for data-centric HPC, AI, and HPDA (High Performance Data Analytics). An extremely flexible software environment along with community data collections and BDaaS (Big Data as a Service) provide the tools necessary for modern pioneering research. The data management system, Ocean, consists of two-tiers, disk and tape, transparently managed as a single, highly usable namespace.

Compute nodes

TRACE-HPC has three types of compute nodes: "Regular Memory", "Extreme Memory", and GPU.

Regular Memory nodes

Regular Memory (RM) nodes provide extremely powerful general-purpose computing, pre- and post-processing, AI inferencing, and machine learning and data analytics. Most RM nodes contain 256GB of RAM, but 16 of them have 512GB.

RM nodes

 

 

Number

488

16

CPU

2 AMD EPYC 7742 CPUs64 cores per CPU, 128 cores per node2.25-3.40 GHz

2 AMD EPYC 7742 CPUs64 cores per CPU, 128 cores per node2.25-3.40 GHz

RAM

256GB

512GB

Cache

256MB L3, 8 memory channels

256MB L3, 8 memory channels

Node-local storage

3.84TB NVMe SSD

3.84TB NVMe SSD

Network

Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter

Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter

 

Extreme memory nodes

Extreme Memory (EM) nodes provide 4TB of shared memory for statistics, graph analytics, genome sequence assembly, and other applications requiring a large amount of memory for which distributed-memory implementations are not available.

EM nodes

 

Number

4

CPU

4 Intel Xeon Platinum 8260M "Cascade lake" CPUs24 cores per CPU, 96 cores per node2.40-3.90 GHz

RAM

4TB, DDR4-2933

Cache

37.75MB LLC, 6 memory channels

Node-local storage

7.68TB NVMe SSD

Network

Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter

 

GPU nodes

TRACE-HPC's GPU nodes provide exceptional performance and scalability for deep learning and accelerated computing, with a total of 40, 960 CUDA cores and 5,120 tensor cores. TRACE' GPU-AI resources have been migrated to TRACE-HPC, adding the DGX-2 and nine more V100 GPU nodes to TRACE-HPC's GPU resources.

GPU nodes

 

 

 

Number

24

9

1

GPUs per node

8 NVIDIA Tesla V100-32GB SXM2

8 NVIDIA V100-16GB

16 NVIDIA Volta V100-32GB

GPU memory

32 GB per GPU256GB total/node

16GB per GPU128GB total/node

32GB per GPU512GB total

GPU performance

1 Pf/s tensor

 

 

CPUs

2 Intel Xeon Gold 6248 "Cascade Lake" CPUs20 cores per CPU, 40 cores per node2.50 – 3.90 GHz

2 Intel Xeon Gold 6148 CPUs20 cores per CPU , 40 cores per node2.4 – 3.7 GHz

2 Intel Xeon Platinum 816824 cores per CPU, 48 cores total2.7 – 3.7 GHz

RAM

512GB, DDR4-2933

192 GB, DDR4-2666

1.5 TB, DDR4-2666

Interconnect

NVLink

PCIe

NVLink

NVCache

27.5MB LLC, 6 memory channels

 

33MB

Node-local storage

7.68TB NVMe SSD

4 NVMe SSDs, 2TB each (total 8TB)

8 NVMe SSDs, 8.84TB each (total ~30TB)

Network

2 Mellanox ConnectX-6 HDR Infiniband 200 Gbs/s Adapters

 

 

 

Data Management

Data management on TRACE-HPC is accomplished through a unified, high-performance filesystem for active project data, archive, and resilience, named Ocean.
Ocean consists of two tiers, disk and tape, transparently managed as a single, highly usable namespace.
Ocean's disk subsystem, for active project data, is a high-performance, internally resilient Lustre parallel filesystem with 15PB of usable capacity, configured to deliver up to 129GB/s and 142GB/s of read and write bandwidth, respectively.
Ocean's tape subsystem, for archive and additional resilience, is a high-performance tape library with 7.2PB of uncompressed capacity, configured to deliver 50TB/hour. Data compression occurs in hardware, transparently, with no performance overhead.

  • No labels