High Performance computing needs lots of data movement, storage, compute and networking, as well as orchestration and automation.

Data Transfer

  • Snowball, Snowmobile
  • AWS DataSync
  • Direct Connect

Storage

  • Instance attached - EBS io2, instance storage
  • Network storage - S3, EFS,
  • FSx for Lustre - scratch mode or persistent mode

Compute and networking

  • GPU optimized instances
  • EC2 fleets (spot instances and/or spot fleets)
  • Placement groups - typically in a cluster arrangement
  • Placement groups - For grid computing, use distributed work loads that are loosely coupled and don’t require tight communication between nodes; ASG has application here; use partition placement groups

orchestration and automation

  • Batch
  • AWS ParallelCluster - an open source orchestration tool that uses text file instead of CloudFormation.

Networking

  • Bigger MTU (9000 byte) enables higher throughput; Risky outside of VPC;
  • Enhanced networking - includes higher bandwidth, higher PPS, lower latency. This can be achieved using elastic network adapters OR the legacy Intel 82599 VF specification. Use Enhanced Networking instance types!
  • Elastic fabric adapter - uses Message Passing Interface (MPI) standard to bypass Linux OS.