SCIAMA
High Performance Compute Cluster
Scheduled Maintenance and Outages
May
We have been experiencing hardware issues with both head nodes on SCIAMA which has caused a couple of short outages. I have been working on stabalising the two head nodes to reduce the impact. The new head nodes are in the process of being built and configured with a new OS - AlmaLinux 8.
March/April
SCIAMA will need to be upgraded to a new operating system, currently running CentOS 7.9 which will be out of support next year. We have new servers to install which provides us an opportunity to rebuild SCIAMA with a newer OS. In order to install the new hardware we need to decomission older compute nodes and remove them from the racks in the data centre, this means the number of compute nodes and therefor cores that are availble in the sciama2.q and sciama3.q will be reduced. This work will begin later this month.
24th January
We have a critical hardware failure on SCIAMA which requires urgent replacement. This requires us to stop lustre. This is a reschedule of last weeks attempt. During the outage you will not be able to read/write any data to /mnt/lustre.
January/February
IS are upgrading the Cisco network switches that SCIAMA uses to connect to the outside world. This upgrade will temporarily disrupt connection to SCIAMA's login nodes and Jupyterhub server for a few minutes. Completed W/C 21st February.
5th September
We have a critical hardware failure on SCIAMA which requires urgent replacement. This requires us to stop lustre and take the users' home directories offline. We have scheduled this for Monday 5th September. During the outage you will not be able to log on to SCIAMA or read/write any data to /mnt/lustre so we suggest you hold off running any further jobs until afterwards.
25th July
SCIAMA is currently experiencing outages. Thank you for your patience while we are working to restore the system.
20th June - 25th June
We will be scheduling regular maintenance windows for SCIAMA to carry out any updates to the OS or applications, the first one is scheduled for 20th June for a week. During this period SCIAMA will be unavailable and any jobs submitted to SLURM that have not completed will be cancelled!
Imortant updates to the OS and SLURM will take a place during this maintenance window.
13th April - 15th April
SCIAMA will need to be rebooted to mount the lustre storage, updates to SLURM scheduler and users home directory. SCIAMA will not be available during this period.
24th -25th March 2022
Urgent maintenance is required to one of the SCIAMA racks. Nodes 224-247 will need to be powered down whilst this maintenance takes place.
I have begun the process of draining the nodes, this affects sciama3.q where only 24 nodes will be available, please bear this in mind when submitting your jobs.
If your job is still running on Monday it may get cancelled, apologies for any inconvenience but the work is necessary.