High Performance computing needs lots of data movement, storage, compute and networking, as well as orchestration and automation.
Data Transfer
- Snowball, Snowmobile
- AWS DataSync
- Direct Connect
Storage
- Instance attached - EBS io2, instance storage
- Network storage - S3, EFS,
- FSx for Lustre - scratch mode or persistent mode
Compute and networking
- GPU optimized instances
- EC2 fleets (spot instances and/or spot fleets)
- Placement groups - typically in a cluster arrangement
- Placement groups - For grid computing, use distributed work loads that are loosely coupled and don’t require tight communication between nodes; ASG has application here; use partition placement groups
orchestration and automation
- Batch
- AWS ParallelCluster - an open source orchestration tool that uses text file instead of CloudFormation.
Networking
- Bigger MTU (9000 byte) enables higher throughput; Risky outside of VPC;
- Enhanced networking - includes higher bandwidth, higher PPS, lower latency. This can be achieved using elastic network adapters OR the legacy Intel 82599 VF specification. Use Enhanced Networking instance types!
- Elastic fabric adapter - uses Message Passing Interface (MPI) standard to bypass Linux OS.