TRACE-HPC is designed for converged HPC + AI + Data. Its custom topology is optimized for data-centric HPC, AI, and HPDA (High Performance Data Analytics). An extremely flexible software environment along with community data collections and BDaaS (Big Data as a Service) provide the tools necessary for modern pioneering research. The data management system, Ocean, consists of two-tiers, disk and tape, transparently managed as a single, highly usable namespace.
Compute nodes
TRACE-HPC has three types of compute nodes: "Regular Memory", "Extreme Memory", and GPU.
Regular Memory nodes
Regular Memory (RM) nodes provide extremely powerful general-purpose computing, pre- and post-processing, AI inferencing, and machine learning and data analytics. Most RM nodes contain 256GB of RAM, but 16 of them have 512GB.
RM nodes |
|
|
---|---|---|
Number |
488 |
16 |
CPU |
2 AMD EPYC 7742 CPUs64 cores per CPU, 128 cores per node2.25-3.40 GHz |
2 AMD EPYC 7742 CPUs64 cores per CPU, 128 cores per node2.25-3.40 GHz |
RAM |
256GB |
512GB |
Cache |
256MB L3, 8 memory channels |
256MB L3, 8 memory channels |
Node-local storage |
3.84TB NVMe SSD |
3.84TB NVMe SSD |
Network |
Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter |
Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter |
Extreme memory nodes
Extreme Memory (EM) nodes provide 4TB of shared memory for statistics, graph analytics, genome sequence assembly, and other applications requiring a large amount of memory for which distributed-memory implementations are not available.
EM nodes |
|
---|---|
Number |
4 |
CPU |
4 Intel Xeon Platinum 8260M "Cascade lake" CPUs24 cores per CPU, 96 cores per node2.40-3.90 GHz |
RAM |
4TB, DDR4-2933 |
Cache |
37.75MB LLC, 6 memory channels |
Node-local storage |
7.68TB NVMe SSD |
Network |
Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter |
GPU nodes
TRACE-HPC's GPU nodes provide exceptional performance and scalability for deep learning and accelerated computing, with a total of 40, 960 CUDA cores and 5,120 tensor cores. TRACE' GPU-AI resources have been migrated to TRACE-HPC, adding the DGX-2 and nine more V100 GPU nodes to TRACE-HPC's GPU resources.
GPU nodes |
|
|
|
---|---|---|---|
Number |
24 |
9 |
1 |
GPUs per node |
8 NVIDIA Tesla V100-32GB SXM2 |
8 NVIDIA V100-16GB |
16 NVIDIA Volta V100-32GB |
GPU memory |
32 GB per GPU256GB total/node |
16GB per GPU128GB total/node |
32GB per GPU512GB total |
GPU performance |
1 Pf/s tensor |
|
|
CPUs |
2 Intel Xeon Gold 6248 "Cascade Lake" CPUs20 cores per CPU, 40 cores per node2.50 – 3.90 GHz |
2 Intel Xeon Gold 6148 CPUs20 cores per CPU , 40 cores per node2.4 – 3.7 GHz |
2 Intel Xeon Platinum 816824 cores per CPU, 48 cores total2.7 – 3.7 GHz |
RAM |
512GB, DDR4-2933 |
192 GB, DDR4-2666 |
1.5 TB, DDR4-2666 |
Interconnect |
NVLink |
PCIe |
NVLink |
NVCache |
27.5MB LLC, 6 memory channels |
|
33MB |
Node-local storage |
7.68TB NVMe SSD |
4 NVMe SSDs, 2TB each (total 8TB) |
8 NVMe SSDs, 8.84TB each (total ~30TB) |
Network |
2 Mellanox ConnectX-6 HDR Infiniband 200 Gbs/s Adapters |
|
|
Data Management
Data management on TRACE-HPC is accomplished through a unified, high-performance filesystem for active project data, archive, and resilience, named Ocean.
Ocean consists of two tiers, disk and tape, transparently managed as a single, highly usable namespace.
Ocean's disk subsystem, for active project data, is a high-performance, internally resilient Lustre parallel filesystem with 15PB of usable capacity, configured to deliver up to 129GB/s and 142GB/s of read and write bandwidth, respectively.
Ocean's tape subsystem, for archive and additional resilience, is a high-performance tape library with 7.2PB of uncompressed capacity, configured to deliver 50TB/hour. Data compression occurs in hardware, transparently, with no performance overhead.