Requirements
AWS Services
The following list of AWS services must be available and usable in the targeted AWS Account where a CCME cluster is supposed to be configured and used.
AWS ParallelCluster v3 services
Services required by AWS ParallelCluster v3 are described in the official AWS ParallelCluster online documentation). We don’t copy here all the services required by ParallelCluster, and refer the reader to this documentation.
Additional mandatory AWS services
There are a few additional services and tools that are required for CCME:
AWS Simple Notification Service (SNS): service used to send email or SMS notifications about clusters-related events
AWS Budgets: service used to configure budgets and send alerts (through AWS SNS) when costs related to AWS resources consumption reaches predefined thresholds
AWS Cost Explorer: service used to monitor almost real-time costs related to AWS resources consumption in the account
AWS Cost and Usage Report: service used to generate costs and usage reports and store them automatically in a s3 bucket
AWS Price List: service used to provide up-to-date AWS resources price at any time to prediction tools running in the backend of the clusters
AWS Directory Service: service used to create and manage users directories to be attached to clusters for users authentication
AWS Key Management Service (KMS): service used to deliver encryption keys for all at-rest and in-transit data in the cloud
AWS Secrets Manager: service used to store in a secure way critical information like passwords for elevated privileges accounts
AWS Certificate Manager: service used to store and manage security certificates in the cloud
AWS Elastic Load Balancing: service used to deliver scalable Load Balancing and proxying capabilities
NICE DCV: AWS software used to deliver remote visualization on Linux and Windows desktops
NICE EnginFrame: AWS software used to deliver HPC-as-a-Service in a web user interface
Additional optional AWS services
Optional services that UCit strongly recommend to enable in the AWS account where the customer’s HPC clusters should run:
AWS Systems Manager (SSM): service that can be used to manage EC2 instances at system level, including for example SSH-like access through HTTPS
AWS Simple Queue Service (SQS): service that can be used to build up execution workflows by chaining tasks together, loosely and asynchronously, through open messages
Amazon Relational Database Service (RDS): service that can be used to store scalable and highly available databases in the cloud
AWS CloudTrail: service that can be used to record and store any API call executed within the AWS account
AWS Athena: service that can be used to query and filter CloudTrail logs stored in a S3 bucket
AWS Cloud9: service that can be used to build up customer-specific IDEs in the cloud
Amazon WAF & Shield: service that can be used to improve security at AWS account’s user access points level
AWS Backup: service that can be used to create cloud-based backup of critical data stored in the cloud
AWS Snow Family: service that can be used to import big amounts of data to the cloud (mainly when live network-based transfers are not possible)
AWS Billing: service that can be used to retrieve detailed billing information on the AWS account
AWS Environment
Key Management Service (KMS)
For CCME security needs, it is MANDATORY to create 9 dedicated customer managed keys.
It’s recommended to define a <KMS_PREFIX>
in order to separate the CCME prerequisites and resources from other resources deployed in your AWS account.
CCME Management Host (CMH) EBS
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/cmh_ebs
DCV Proxy EBS
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/dcv_proxy_ebs
Key policy: Configure the KMS policy as following, replacing <ACCOUNT_ID> by your AWS account ID.
{ "Id": "KMS DCV Proxy Policy", "Version": "2012-10-17", "Statement": [ { "Sid": "Enable IAM User Permissions", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<ACCOUNT_ID>:root" }, "Action": "kms:*", "Resource": "*" }, { "Sid": "Allow use of the key", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling" }, "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*", "kms:GenerateDataKey*", "kms:DescribeKey" ], "Resource": "*" }, { "Sid": "Allow attachment of persistent resources", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling" }, "Action": [ "kms:CreateGrant", "kms:ListGrants", "kms:RevokeGrant" ], "Resource": "*", "Condition": { "Bool": { "kms:GrantIsForAWSResource": "true" } } } ] }
Clusters EBS
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Multi-Region key
Alias:
<KMS_PREFIX>/clusters_ebs
File Systems (EFS and FSx)
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/file_systems
S3
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/s3
SNS
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/sns
Secrets
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/secrets
CloudWatch
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/cloudWatch
Key policy: Configure the KMS policy as following, replacing
<ACCOUNT_ID>
by your AWS account ID and<AWS_REGION>
by th AWS deployment region.{ "Version": "2012-10-17", "Id": "KMS CloudWatch Policy", "Statement": [ { "Sid": "Enable IAM User Permissions", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<ACCOUNT_ID>:root" }, "Action": "kms:*", "Resource": "*" }, { "Effect": "Allow", "Principal": { "Service": "logs.<AWS_REGION>.amazonaws.com" }, "Action": [ "kms:Encrypt*", "kms:Decrypt*", "kms:ReEncrypt*", "kms:GenerateDataKey*", "kms:Describe*" ], "Resource": "*" } ] }
Lambda
Key type:
Symmetric
Key usage:
Encrypt and decrypt
Advanced options:
Key material origin:
KMS - recommended
Regionality:
Single-Region key
Alias:
<KMS_PREFIX>/lambda
S3 Storage
For CCME storage needs, it is necessary to create 3 dedicated private buckets with the AWS S3 service:
One for the S3 bucket access logs, defined as follows:
Note
This bucket will store the logs access of the following buckets.
Name (example):
ucit-ccme-logging-eu-west-1
Default properties
One for the CCME solution (scripts, templates and configuration files), defined as follows:
Name (example):
ucit-ccme-internals-eu-west-1
Default properties
One for the long time storage of data, defined as follows:
Name (example):
ucit-ccme-userdata-eu-west-1
Default properties
For those buckets:
Block public access: block all
BlockPublicAcls
: TrueBlockPublicPolicy
: TrueIgnorePublicAcls
: TrueRestrictPublicBuckets
: True
Versioning configuration: Enabled
Enable Server access logging to your S3 logging bucket with a prefix based on your bucket name (
EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME/
as example).Bucket Encryption: Enable encryption by default using AES256
S3 Bucket Policy allowing only TLS > 1.2
Using
EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME
as example of bucket name.Format as yaml
Version: 2012-10-17 Id: CCMEBucketPolicy Statement: - Sid: enforce-tls-12-requests-only Effect: Deny Principal: AWS: '*' Action: '*' Resource: - 'arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME/*' - 'arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME' Condition: NumericLessThan: 's3:TlsVersion': '1.2' - Sid: enforce-tls-requests-only Effect: Deny Principal: AWS: '*' Action: '*' Resource: - 'arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME/*' - 'arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME' Condition: Bool: 'aws:SecureTransport': 'false'
Format as JSON:
{ "Version": "2012-10-17", "Id": "CCMEBucketPolicy", "Statement": [ { "Sid": "enforce-tls-12-requests-only", "Effect": "Deny", "Principal": { "AWS": "*" }, "Action": "*", "Resource": [ "arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME/*", "arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME" ], "Condition": { "NumericLessThan": { "s3:TlsVersion": "1.2" } } }, { "Sid": "enforce-tls-requests-only", "Effect": "Deny", "Principal": { "AWS": "*" }, "Action": "*", "Resource": [ "arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME/*", "arn:aws:s3:::EXAMPLE_OF_CCME_CLUSTER_S3_BUCKET_NAME" ], "Condition": { "Bool": { "aws:SecureTransport": "false" } } } ] }
Virtual Private Cloud
CCME inherits from AWS ParallelCluster the ability to create on the fly for each cluster the network environment in which said cluster will run.
For the network needs of your future CCME clusters, it is possible to create in advance a network environment specifically adapted to your needs with the AWS VPC service. In this case, you can create:
VPC with the following attributes:
Name (ex:
VPC-CCME
)CIDR block large enough to include subnets (ex:
10.0.0.0/16
)CIDR IPv6:
false
“Tenancy” parameter:
Default
Options:
DNS Resolution:
yes
DNS Hostnames:
yes
Also check that there is a set of DHCP options created by default with the following value:
domain-name = <REGION>.compute.internal; domain-name-servers = AmazonProvidedDNS;
Where the value of
<REGION>
is set relative to the AWS Region in which the environment is created, e.g., Ireland region =eu-west-1
One VPC EndPoint with the following attributes:
Name (example):
CCME-S3Endpoint
Type:
Gateway
Service:
com.amazonaws.{{ REGION }}.s3
Subnets with the following attributes:
Two
FrontEnd
subnetName (example):
CCME-FrontEnd-subnet-az1
andCCME-FrontEnd-subnet-az2
CIDR compatible with the previously created VPC (ex:
10.0.1.0/24
and10.0.2.0/24
)Availability Zone: The two subnets must be in different availability zones.
Note: Those subnets must be public if the Application Load Balancer should be accessible through Internet
Two
BackEnd
subnetName (example):
CCME-BackEnd-subnet-az1
andCCME-BackEnd-subnet-az2
CIDR compatible with the previously created VPC (ex:
10.0.3.0/24
and10.0.4.0/24
)Availability Zone: The two subnets must be in different availability zones
Note: Those subnets must be private for security reasons but can be public to allow CMH and clusters with public IP
For the network configuration and internet access of the ALB, CMH and clusters there are three possibilities:
Both
FrontEnd
subnets andBackEnd
subnets are public
One Internet Gateway (IGW) with the following attributes:
Name (example):
CCME-IGW
Attached to: The previously created VPC
Subnet: You need to create that Internet Gateway within one of the previously created
FrontEnd
subnetsAfter the creation:
Update the route table (rt) of the
FrontEnd
subnets
Destination:
0.0.0.0/0
Target: ID of the Internet Gateway
Update the route table (rt) of the
BackEnd
subnets
Destination:
0.0.0.0/0
Target: ID of the Internet Gateway
FrontEnd
subnets are public andBackEnd
subnets are private
One Internet Gateway (IGW) with the following attributes:
Name (example):
CCME-IGW
Attached to: The previously created VPC
Subnet: You need to create that Internet Gateway within one of the previously created
FrontEnd
subnetsAfter the creation:
Update the route table (rt) of the
FrontEnd
subnets
Destination:
0.0.0.0/0
Target: ID of the Internet Gateway
One NAT Gateway with the following attributes:
Name (example):
CCME-NGW
Attached to: The previously created VPC
Connectivity type:
Public
Subnet: You need to create that NAT Gateway within one of the previously created
FrontEnd
subnetsAfter the creation:
Update the route table (rt) of the
BackEnd
subnets
Destination:
0.0.0.0/0
Target: ID of the Network Gateway
Both
FrontEnd
subnets andBackEnd
subnets are private
One
Public
subnet
Name (example):
CCME-Public-subnet-az1
CIDR compatible with the previously created VPC (ex:
10.0.5.0/24
)Availability Zone: One of the availability zones used for the
FrontEnd
andBackEnd
subnets.Note: Those subnets must be public if the Application Load Balancer should be accessible through Internet
One Internet Gateway (IGW) with the following attributes:
Name (example):
CCME-IGW
Attached to: The previously created VPC
Subnet: You need to create that NAT Gateway within one of the previously created
Public
subnetsAfter the creation:
Update the route table (rt) of the
Public
subnet
Destination:
0.0.0.0/0
Target: ID of the Internet Gateway
One NAT Gateway with the following attributes:
Name (example):
CCME-NGW
Attached to: The previously created VPC
Connectivity type:
Public
Subnet: You need to create that NAT Gateway within one of the previously created
Public
subnetsAfter the creation:
Update the route table (rt) of the
FrontEnd
subnets
Destination:
0.0.0.0/0
Target: ID of the Network Gateway
Update the route table (rt) of the
BackEnd
subnets
Destination:
0.0.0.0/0
Target: ID of the Network Gateway
Service Secrets Manager
For the security needs of future access to the directory service (Active Directory or LDAP, in AWS or not), it is necessary to create a “secret” with the Secrets Manager service in order to store the password of a user with read rights to the directory service.
The created secret must be of type String
and written as plaintext
.
SSL Certificates
For the security needs of access to future CCME clusters, it is necessary to create an HTTPS (X.509) certificate signed by a certification authority, or to import a pre-existing certificate, into the Certificate Manager service. This certificate will be used/associated with the Application Load Balancer (ALB).
Supported Operating Systems
CCME supports the following OS (Image.Os
parameter of ParallelCluster configuration file):
Amazon Linux 2
x86_64
andaarch64
(alinux2
)Red Hat Entreprise Linux 8/9
x86_64
andaarch64
(rhel8
andrhel9
)Rocky 8/9
x86_64
andaarch64
(rocky8
androcky9
) only fromCustomAmi
Warning
aarch64
does not support GPU instances
CCME does not yet support DCV on instances with Nvidia GPU on Rocky Linux 9 (apart from g4dn instances).
Note
NVIDIA drivers are not installed by default for Rocky 8/9 with GPU instances.
You can build a custom ParallelCluster image of Rocky8
or Rocky9
with Nvidia drivers.
For Rocky8:
With your
pcluster build-image
configuration file:Set
Rocky8
image asParentImage
Set the following settings for
DevSettings: Cookbook: ExtraChefAttributes: '{"cluster": {"nvidia": {"enabled": "yes"}}}'
Use
pcluster build-image
command line interface (CLI)Use that newly created AMI as
CustomAmi
for your cluster configuration files
For Rocky9 (only for g4dn instances):
Use
pcluster build-image
command line interface (CLI)Set
Rocky9
image asParentImage
Launch EC2 instance based on the image generated by pcluster
Run the following script on the newly created instance:
# Documentation: https://github.com/NVIDIA/open-gpu-kernel-modules/tree/535 nvidia_version="535.183.01" wget "https://us.download.nvidia.com/XFree86/Linux-x86_64/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run" wget "https://github.com/NVIDIA/open-gpu-kernel-modules/archive/refs/heads/535.zip" unzip 535.zip cd open-gpu-kernel-modules-535 make modules -j$(nproc) make modules_install -j$(nproc) sudo sh NVIDIA-Linux-x86_64-${nvidia_version}.run -Z -s --no-kernel-modules
Create an AMI based on your EC2 instance
Use that newly created AMI as
CustomAmi
for your cluster configuration files
CCME uses the default ParallelCluster AMIs when launching a cluster. It is possible to use a custom-made AMI by following the following documentation: Building a custom AWS ParallelCluster AMI.
The CCME Management Host supports the following OS (management_host_os
parameter of deployment configuration file):
Amazon Linux 2023 x86_64 and aarch64 (
al2023
)Red Hat Entreprise Linux 8/9 x86_64 (
rhel8
andrhel9
)
Whitelisting URLs and repositories
In order to build CCME AMIs (including installing AWS Parallelcluster) or even just to run CCME without a specific AMI, the following URLs need to be accessible from the instances (either the one used to build the AMI, or all instances deployed by CCME/AWS ParallelCluster):
URLs to download AWS ParallelCluster
URLs for Red Hat
URLs for Amazon and AWS
URLs for NVIDIA
URLs for DCV and EnginFrame
URLs for EF Portal
URLs variaous packages and tools
List of repositories that must be allowed in case you have a local yum repository (mirror):
aws-parallelcluster-platform::nvidia_install
aws-parallelcluster-platform::install
aws-parallelcluster-environment::install
aws-parallelcluster-slurm::install
aws-parallelcluster-slurm::install_jwt
aws-parallelcluster-slurm::install_slurm
aws-fsx
yum-epel::default
Packages and dependencies
Multiple third-party software installers are used by CCME. In a case of a private deployment the next packages must be placed in: - CCME/pkgs/deps/ - management/pkgs/deps
Please refer to the dependencies.yaml
file regarding the version of the packages.
For all dependencies, two variables are available:
- arch allows x86_64
or aarch64
- os allows el7
or el8
- Management Host:
aws-cfn-bootstrap-py3-latest.tar.gz
amazon-ssm-agent.rpm
amazon-ssm-agent.rpm.sig
ansible-posix-${ansible_collection.ansible_posix}.tar.gz
community-general-${ansible_collection.community_general}.tar.gz
amazon-aws-${ansible_collection.amazon_aws}.tar.gz
community-aws-${ansible_collection.community_aws}.tar.gz
- Clusters:
Multi-arch:
NICE-GPG-KEY
MariaDB-Server-GPG-KEY
slurm-${slurm}.tar.gz
enginframe-${enginframe.version.}-r${enginframe.revision}.jar
ansible-posix-${ansible_collection.ansible_posix}.tar.gz
community-general-${ansible_collection.community_general}.tar.gz
amazon-aws-${ansible_collection.amazon_aws}.tar.gz
community-aws-${ansible_collection.community_aws}.tar.gz
arch = x86_64:
amazon-ssm-agent.rpm
amazon-ssm-agent.rpm.sig
galera-${galera.os.arch}.rpm
MariaDB-client-${mariadb.os.arch}.${arch}.rpm
MariaDB-common-${mariadb.os.arch}.${arch}.rpm
MariaDB-compat-${mariadb.os.arch}.${arch}.rpm
MariaDB-server-${mariadb.os.arch}.${arch}.rpm
nice-dcv-gl-${dcv.version}.${dcv.gl}.${os}.${arch}.rpm
nice-dcv-gltest-${dcv.version}.${dcv.gltest}.${os}.${arch}.rpm
nice-dcv-server-${dcv.version}.${dcv.patch}-1.${os}.${arch}.rpm
nice-dcv-session-manager-agent-${dcvsm.version}.${dcvsm.agent_patch}.${os}.${arch}.rpm
nice-dcv-session-manager-broker-${dcvsm.version}.${dcvsm.broker_patch}.${os}.noarch.rpm
nice-dcv-simple-external-authenticator-${dcv.version}.${dcv.sea}.${os}.${arch}.rpm
nice-dcv-web-viewer-${dcv.version}.${dcv.patch}-1.${os}.${arch}.rpm
nice-xdcv-${dcv.version}.${dcv.xdcv}.${os}.${arch}.rpm
arch = aarch64:
amazon-ssm-agent.rpm
amazon-ssm-agent.rpm.sig
galera-${galera.os.arch}.${arch}.rpm
MariaDB-client-${mariadb.os.arch}.${os}.centos.${arch}.rpm
MariaDB-common-${mariadb.os.arch}.${os}.centos.${arch}.rpm
MariaDB-compat-${mariadb.os.arch}.${os}.centos.${arch}.rpm
MariaDB-server-${mariadb.os.arch}.${os}.centos.${arch}.rpm
nice-dcv-server-${dcv.version}.${dcv.patch}-1.${os}.${arch}.rpm
nice-dcv-session-manager-agent-${dcvsm.version}.${dcvsm.agent_patch}.${os}.${arch}.rpm
nice-dcv-session-manager-broker-${dcvsm.version}.${dcvsm.broker_patch}.${os}.noarch.rpm
nice-dcv-simple-external-authenticator-${dcv.version}.${dcv.sea}.${os}.${arch}.rpm
nice-dcv-web-viewer-${dcv.version}.${dcv.patch}-1.${os}.${arch}.rpm
nice-xdcv-${dcv.version}.${dcv.xdcv}.${os}.${arch}.rpm
s3fs-fuse-${s3fsfuse.arch}.${os}.${arch}.rpm
Costs
This AWS calculator shows an example of costs associated with using CCME under the following assumptions:
All instances prices are on-demand
The selected region is
eu-west-1
(Ireland)CCME Management stack
Management Host deployed on a
t2.small
instance, with 8BG of EBS gp3Active Directory deployed by CCME
10GB of storage in S3
Cluster
Headnode deployed on a
m5.4xlarge
instance, with 100GB of EBS gp3compute
1 queue with
hpc7a.96xlarge
instances. Usage on average over a month: 3 instances, thus 576 cores.1 queue with
g5.4xlarge
instances. Usage on average over a month: 1 instance.
Storages
1 TB of Home directory, hosted on an EFS
1.2 TB of scratch, hosted on an FSx for Lustre, 1000 MBps/TiB
Warning
The above AWS Calculator is just an example that must be adapted to the actual deployment you plan to do (size of the headnode, number and type of compute resources, types and size of storages…). You must also take into account any discount you may have on your AWS account, including reserved instances. Spot pricing can also be considered for short workloads, or workloads with checkpoint and restart capabilities.