Cluster

Configuration

Global configuration

Add a configuration to the pcluster config file in /home/$(whoami)/.parallelcluster/config or edit /home/$(whoami)/.parallelcluster/simple for a simple cluster. In order to launch a cluster, you first need to create a configuration file for the cluster. You can either create your own, or edit one of examples provided in /home/$(whoami)/.parallelcluster/ on the Management Host. We provide below an example, in which you must replace each yaml variable (e.g., {{ CCME_VARIABLE }}) by the required information. Make sure you use the correct subnets, security groups and policies.

The HeadNode.CustomActions.OnNodeStart.Args section of the ParallelCluster configuration file can contain a set of parameters that will influence the configuration of your cluster. Here is the list of supported parameters:

  • S3 related parameters

    • CCME_S3FS (optional): list of S3 buckets to mount through S3FS (the policies attached to the HeadNode and Computes need to have read/write access to these buckets). Format is CSV: CCME_S3FS=bucket1,bucket2,bucket3 (or simply CCME_S3FS=bucket if there is a single bucket). If this variable is unset or equal to NONE, then no bucket is mounted through S3FS.

    • CCME_JSLOGS_BUCKET (mandatory): name of an S3 bucket on which Slurm accounting logs will be exported (see Slurm accounting logs, the policies attached to the HeadNode and Computes need to have read/write access to this bucket).

  • ALB related parameters

    • CCME_OIDC: a prefix used to locate the CCME/conf/${CCME_OIDC}.oidc.yaml file that contains configurations for OIDC authentication (see OIDC External Authentication).

    • CCME_DNS: the DNS name of the ALB (if you are not using the default ALB DNS name), e.g., my.domain.com. The cluster url is alb_url/my_cluster_name/portal/ for portal and alb_url/my_cluster_name/dcv-instance-id as default value. That default url is the the application load balancer and can be replaced by a custom Domain Name Server. If you want to configure your DNS allowing your personal.domain.com pointing to the Application Load Balancer in order to have replace your url and have a personal.domain.com/my_cluster/portal/ and personal.domain.com/my_cluster/visualization url for your cluster web access, then set the CCME_DNS to personal.domain.com instead of NONE.

  • User related parameters

    • CCME_USER_HOME: The user home can be mounted from a file system, binding the CCME_USER_HOME variable to the mountpoint path. You can use the %u parameter to retrieve the username (e.g., /file-system/home/%u).

  • Remote visualization related parameters (see Prerequisites and configuration for a complete description of these parameters):

    • CCME_WIN_LAUNCH_TEMPLATE_ID: Launch template used to launch Windows EC2 instances. CCME creates a default launch template when deploying the CMH, but you can setup your own here.

    • CCME_WIN_AMI: ID of the AMI used to launch Windows EC2 instances (see Prerequisites and configuration for prerequisites).

    • CCME_WIN_INSTANCE_TYPE: instances type used to launch Windows EC2 instances.

    • CCME_WIN_INACTIVE_SESSION_TIME, CCME_WIN_NO_SESSION_TIME and CCME_WIN_NO_BROKER_COMMUNICATION_TIME: parameters to control the lifecycle of Windows remote visualization sessions.

  • EnginFrame related parameters

    • CCME_EF_ADMIN_GROUP: the name of an OS group, users belonging to this group will automatically be promoted as administrators in EnginFrame (no sudo access).

    • CCME_EF_ADMIN_PASSWORD: ARN of a secret containing the password of EnginFrame admin account. The expected value in the preexisting ARN of a plaintext string stored in AWS Secrets Manager (ASM). Do not set this variable to let CCME generate and store a password in /shared/CCME/ccme.passwords.efadmin

  • Remote access related parameters

    • CCME_AWS_SSM: if set to true, AWS SSM agent will be installed on all the nodes, to allow remote connection to them with AWS SSM.

Note

No parameter must be set on the following sections, as they will be inherited from the HeadNode.CustomActions.OnNodeStart.Args parameters:

  • HeadNode.CustomActions.OnNodeConfigured:.Args

  • HeadNode.CustomActions.OnNodeUpdated:.Args

  • HeadNode.CustomActions.OnNodeStart.Args

  • HeadNode.CustomActions.OnNodeConfigured.Args

Note

The CCME solution is downloaded from S3 on the HeadNode. Then the download directory is mounted on /opt/CCME on each ComputeNodes from the HeadNode using NFS.

CCME applies its configurations and installs software on ParallelCluster clusters through a set of Bash and Ansible scripts. The entry points are the Bash scripts specified in the OnNodeStart, OnNodeConfigured and OnNodeUpdated parameters of the HeadNode.CustomActions and Scheduling.SlurmQueues[].CustomActions sections. The values presented below (and present in the generated example configuration files) must always be present.

Example ParallelCluster configuration file for CCME
  1Region: '{{ AWS_REGION }}'
  2CustomS3Bucket: '{{ CCME_CLUSTER_S3BUCKET }}'
  3Iam:
  4  Roles:
  5    LambdaFunctionsRole: '{{ CCME_CLUSTER_LAMBDA_ROLE }}'
  6  # If the role associated to the cluster includes a custom IAM path prefix,
  7  # replace "parallelcluster" by the custom IAM path prefix.
  8  ResourcePrefix: "parallelcluster"
  9Image:
 10  Os: {{ "alinux2" or "centos7" }}
 11Tags:
 12  - Key: Owner
 13    Value: '{{ CCME_OWNER }}'
 14  - Key: Reason
 15    Value: '{{ CCME_REASON }}'
 16SharedStorage:
 17  - Name: shared
 18    StorageType: Ebs
 19    MountDir: shared
 20HeadNode:
 21  InstanceType: t3.medium
 22  Networking:
 23    SubnetId: '{{ CCME_SUBNET }}'
 24    SecurityGroups:
 25      - '{{ CCME_PRIVATE_SG }}'
 26  Ssh:
 27    KeyName: '{{ AWS_KEYNAME }}'
 28  LocalStorage:
 29    RootVolume:
 30      Size: 50
 31      Encrypted: true
 32  CustomActions:
 33    OnNodeStart:
 34      Script: s3://{{ CCME_SOURCES }}CCME/sbin/pre-install.sh
 35      Args:
 36        - CCME_S3FS={{ CCME_DATA_BUCKET }}
 37        - CCME_JSLOGS_BUCKET={{ CCME_DATA_BUCKET }}
 38        - CCME_EF_ADMIN_PASSWORD="arn:aws:secretsmanager:eu-west-1:012345678910:secret:ccme-prefix-efadmin.password-4riso"
 39        - CCME_OIDC=default
 40        - CCME_USER_HOME=/file-system/home/%u
 41        - CCME_DNS="my.domain.com"
 42        - CCME_EF_ADMIN_GROUP="Administrators"
 43        - CCME_REPOSITORY_PIP="https://my.pip.domain.com/index/,https://my.pip.domain.com/index-url/"
 44        # Optional windows fleet
 45        - CCME_WIN_AMI="ami-i..."
 46        - CCME_WIN_INSTANCE_TYPE=NONE
 47        - CCME_WIN_INACTIVE_SESSION_TIME=600
 48        - CCME_WIN_NO_SESSION_TIME=600
 49        - CCME_WIN_NO_BROKER_COMMUNICATION_TIME=600
 50        - CCME_AWS_SSM=true
 51    OnNodeConfigured:
 52      Script: s3://{{ CCME_SOURCES }}CCME/sbin/post-install.sh
 53    OnNodeUpdated:
 54      Script: s3://{{ CCME_SOURCES }}CCME/sbin/update-install.sh
 55  Iam:
 56    S3Access:
 57      - BucketName: '{{ CCME_BUCKET }}'
 58      - EnableWriteAccess: true
 59        BucketName: '{{ CCME_DATA_BUCKET }}'
 60    AdditionalIamPolicies:
 61    - Policy: '{{ CCME_CLUSTER_POLICY_ALB }}'
 62    - Policy: '{{ CCME_CLUSTER_POLICY_DCV }}'
 63    - Policy: '{{ CCME_CLUSTER_POLICY_EF }}'
 64    - Policy: '{{ CCME_CLUSTER_POLICY_JOB_COSTS }}'
 65    - Policy: '{{ CCME_CLUSTER_POLICY_SNS }}'
 66    - Policy: 'arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore'
 67Scheduling:
 68  Scheduler: slurm
 69  SlurmSettings:
 70    Dns:
 71      # If the role associated to the cluster is not authorized to use Route 53,
 72      # set "DisableManagedDns" to true.
 73      DisableManagedDns: False
 74  SlurmQueues:
 75    - Name: basic-slurm
 76      CapacityType: ONDEMAND
 77      ComputeSettings:
 78        LocalStorage:
 79          RootVolume:
 80            Size: 50
 81            Encrypted: true
 82      ComputeResources:
 83        - Name: t2-small
 84          InstanceType: t2.small
 85          MinCount: 0
 86          MaxCount: 2
 87      CustomActions:
 88        OnNodeStart:
 89          Script: s3://{{ CCME_SOURCES }}CCME/sbin/pre-install.sh
 90        OnNodeConfigured:
 91          Script: s3://{{ CCME_SOURCES }}CCME/sbin/post-install.sh
 92      Iam:
 93        S3Access:
 94          - BucketName: '{{ CCME_BUCKET }}'
 95          - EnableWriteAccess: true
 96            BucketName: '{{ CCME_DATA_BUCKET }}'
 97        AdditionalIamPolicies:
 98        - Policy: '{{ CCME_CLUSTER_POLICY_ALB }}'
 99        - Policy: '{{ CCME_CLUSTER_POLICY_DCV }}'
100        - Policy: '{{ CCME_CLUSTER_POLICY_JOB_COSTS }}'
101        - Policy: '{{ CCME_CLUSTER_POLICY_SNS }}'
102        - Policy: 'arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore'
103      Networking:
104        SubnetIds:
105          - '{{ CCME_SUBNET }}'
106        SecurityGroups:
107          - '{{ CCME_COMPUTE_SG }}'
108    - Name: dcv-basic
109      CapacityType: ONDEMAND
110      ComputeResources:
111        - Name: t3-medium
112          InstanceType: t3.medium
113          MinCount: 0
114          MaxCount: 2
115      CustomActions:
116        OnNodeStart:
117          Script: s3://{{ CCME_SOURCES }}CCME/sbin/pre-install.sh
118        OnNodeConfigured:
119          Script: s3://{{ CCME_SOURCES }}CCME/sbin/post-install.sh
120      Iam:
121        S3Access:
122          - BucketName: '{{ CCME_BUCKET }}'
123          - EnableWriteAccess: true
124            BucketName: '{{ CCME_DATA_BUCKET }}'
125        AdditionalIamPolicies:
126        - Policy: '{{ CCME_CLUSTER_POLICY_ALB }}'
127        - Policy: '{{ CCME_CLUSTER_POLICY_DCV }}'
128        - Policy: '{{ CCME_CLUSTER_POLICY_JOB_COSTS }}'
129        - Policy: '{{ CCME_CLUSTER_POLICY_SNS }}'
130        - Policy: 'arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore'
131      Networking:
132        SubnetIds:
133          - '{{ CCME_SUBNET }}'
134        SecurityGroups:
135          - '{{ CCME_COMPUTE_SG }}'
136DirectoryService:
137  DomainName: {{ CCME_AD_DIR_NAME }}
138  DomainAddr: ldap://{{ CCME_AD_IP_1 }},ldap://{{ CCME_AD_IP_2 }}
139  PasswordSecretArn: {{ CCME_AD_PASSWORD }}
140  DomainReadOnlyUser: cn={{ CCME_AD_READ_ONLY_USER }},ou=Users,ou={{ CCME_AD_DIR_NAME_DC1 }},dc={{ CCME_AD_DIR_NAME_DC1 }},dc={{ CCME_AD_DIR_NAME_DC2 }}
141  LdapTlsReqCert: never # Set "hard" to enable ldaps
142  # LdapTlsCaCert: /opt/CCME/conf/{{ CCME_CA_FILE }} # Set it only with ldaps
143  # LdapAccessFilter:
144  AdditionalSssdConfigs:
145    # debug_level: "0x1ff" # Uncomment for logs, can be heavy
146    ldap_auth_disable_tls_never_use_in_production: True # Don't set it with ldaps

Note

If in the configuration of your cluster you want to use an external resource such as an EFS or FSx for NetApp file system that hasn’t been deployed by CCME, you will need to ensure that the targeted resource has at least one tag which name starts with ccme. For security reasons, CCME roles only allow CCME to describe and use services that have such ccme* tags.

We recommend that you use for example an explicit tag named such as: ccme:allow (the value is not important, but for readability reasons use a value of true for example).

Without such a tag, you will get an error message when trying to launch a cluster, for example when trying to connect an FSx for NetApp ONTAP without a ccme* tag, you can get an error like:

"message": "Invalid cluster configuration: User:
arn:aws:sts::123456789012:assumed-role/CRS-myrole-ParallelClusterUserRole-10T744D833QZH/i-0423d0720df91381b
is not authorized to perform: fsx:DescribeVolumes on resource: arn:aws:fsx:eu-west-3:123456789012:volume/*/*
because no identity-based policy allows the fsx:DescribeVolumes action"

Custom Scripts

On top of CCME specific configurations, you can integrate your own custom scripts to CCME. To deploy a cluster embedding and executing your own custom scripts, you must place them in the CCME/custom directory and synchronize this directory in the S3 bucket. You can provide your own Ansible playbooks or Bash scripts to add specific configurations to the HeadNode, or to the Compute Nodes, or to all nodes. Ansible playbooks and Bash scripts are executed in this order:

  1. install-\*-all.yaml: run Ansible playbook on all nodes (Head and Compute nodes)

  2. install-\*-head.yaml: run Ansible playbook on Head Node only

  3. install-\*-compute.yaml: run Ansible playbook on Compute Nodes only

  4. install-\*-all.sh: run Bash script on all nodes (Head and Compute nodes)

  5. install-\*-head.sh: run Bash script on Head Node only

  6. install-\*-compute.sh: run Bash script on Compute Nodes only

To update the CCME solution bucket from the Management Host, use the updateCCME.sh command.

$ ../../management/sbin/scripts/updateCCME.sh
updateCCME <S3BUCKET> <OPTIONAL: CCME.CONF> <OPTIONAL: AWS CREDENTIAL PROFILE>
 - S3BUCKET:  Name of the S3 bucket solution to upload CCME
 - CCME.CONF: Path to a ccme.conf file to replace the bucket

Note

Using the updateCCME.sh command on a Management Host does not require to specify a ccme.conf file. It will take the correct CCME conf file from: /opt/ccme/CCME/conf/ccme.conf.

Management

Note

The aws region is a required parameter, as --region option from the CLI or from the region option in the ParallelCluster configuration file used with the command.

If the aws region is specified in command line interface and in the cluster configuration file, the selected region will be the one from the CLI in priority.

Create cluster

To create a cluster, use the following command

pcluster create-cluster --cluster-name "${cluster_name}" --cluster-configuration ~/.parallelcluster/${configuration_file}" --region "${aws_region}"

Note

If you are creating your first clusters (or a first cluster in a new environment), it is strongly recommended to create this cluster in debug mode, by setting the rollback-on-failure pcluster parameter to false with --rollback-on-failure false as shown in the command below.

pcluster create-cluster --rollback-on-failure false --cluster-name "${cluster_name}" --cluster-configuration ~/.parallelcluster/${configuration_file}" --region "${aws_region}"

Delete cluster

pcluster delete-cluster --cluster-name "${cluster_name}" --region "${aws_region}"

List clusters

pcluster list-cluster --region "${aws_region}"

Connect to the clusters

The possibilities to connect to a deployed cluster(s) are:

  • Use the administrator account (sudoer)

    • centos or ec2-user depending of the selected OS

    • The ssh key associated to this user is selected in the cluster config file at deployment

  • Use any user from the ActiveDirectory authorized to connect to the cluster

    • With the tuple username + password

    • With the username + ssh key (after first username + password authentication) - The ssh key is available in the user home ~/.ssh/ after a first username + password authentication