Cluster deployment
Deploy Cluster in debug mode
To debug deployment of Cluster you have to add –rollback-on-failure false to disable rollback in Cloudformation
$ pcluster create-cluster --cluster-name ClusterName --cluster-configuration configuration_file.yaml --region $REGION --rollback-on-failure false
See deployment Logs
Installation logs location: /var/log/
You can also see the logs in AWS CloudWatch under your cluster name for more information you can see in AWS Parallel Cluster in CloudWatch
CCME logs are also present in the same CloudWatch group: search for ccme.
in the log group to find all CCME logs.
Accessing CCME logs
All logs are available on the Headnode and compute nodes of the cluster in /var/log
, and also in the AWS CloudWatch LogGroup created with the cluster
(see Amazon CloudWatch Logs cluster logs ).
The CloudWatch LogGroup /aws/parallelcluster/<cluster_name>-<StackId>
includes:
The retention period in day is:
For the CloudWatch LogGroup: The retention period is defined by the ccme_logs_retention_in_days
CMH parameter, the default value is 14 days.
For the Instances: The retention period is 7 days.
All CCME logs start with ccme.
. For example to see why the pre-install script failed to run you have to see in /var/log/ccme.pre-install.log
and
then to see for post-install script /var/log/ccme.post-install.log
Prevent compute nodes from being killed when there is a problem
AWS ParallelCluster will kill any compute node that has an issue.
If this happens too often, or at every startup of a node, you need to access the logs in /var/log/ccme*.log
to see what is the problem.
To prevent nodes from being killed, you need to log in your AWS console, go to EC2, and identify the instance you want to protect. You then need to
set the following configurations to the instance (Actions/Instance settings
):
Change termination protection
: Enable
Change stop protection
: Enable
Change shutdown behavior
: Stop
Accessing Slurm logs
Slurm logs are available on the Headnode of the cluster in:
Slurm logs are available on the Compute Nodes of the cluster in:
Accessing EnginFrame through Application Load Balancer
The Application Load Balancer (ALB) has a timeout
value of 300
seconds
This timeout can be reached when uploading/downloading a huge amount of data
Users see user (USER1) is not authorized to modify attributes of spooler
errors in EnginFrame portal
Whenever a user tries to submit a job or a VDI session, the see the following error:
user (USER1) is not authorized to modify attributes of spooler
.
In ef.log
file, if you see errors similar to the following, then you may have an issue
with users’ identity seen by EnginFrame.
2024 / Jul / 01 15 : 56.24 ERROR TID [ 62707 ] USER1 EnginFrameServlet . doService ( EnginFrameServlet . java : 234 ): handling request
java . lang . NullPointerException : Neither key nor value can be loaded as null . [ mapName : spooler . map , key : spooler : /// shared / nice / spoolers / USER2 / tmp11000975990629747030 . session . ef , value : null ]
You can first try to restart EnginFrame, it usually fixes the issue: systemctl restart enginframe
.
It sometimes happens that you need to clear the SSSD cache as described
here
You can cleanup persistence data for a specific user by following this procedure that need a short
period of unavailability of EnginFrame service:
make the user close all his/her sessions
identify all files belonging to [USERNAME]
in the directories: repository
, sessions
and spoolers
:
ls /opt/nice/enginframe/{repository,sessions,spoolers}/[USERNAME]/*
stop enginframe: systemctl stop enginframe
remove all files identified by previous ls
command
start enginframe: systemctl start enginframe
Slow instance startup
Create a CCME AMI
If possible, do not resize the EBS ComputeSettings.LocalStorage.RootVolume.Size
as during the boot phase
a growpart
phase will occur, and it can take several minutes to finish.