The Cluster Monitoring Toolkit¶
This toolkit provides two main tools for monitoring the state and useage of your cluster: Ganglia and OpenXDMoD.
Ganglia is used primarily to monitor load over time, and is useful for at-a-glance visualization of the current state of the cluster.
OpenXDMoD is used to aggregate scheduler data into reports, and investigate useage metrics across departments, projects, or PIs. This can be very useful in applying for additional funding, or working to improve the general health of your research computing program.
Ansible Playbook¶
The main playbook here is split between two roles, one for Ganglia, and one for OpenXDMoD.
They are triggered with the usual -t
flag to ansible:
ansible-playbook -t ganglia,openxdmod clustermon.yml
The Ganglia role installes the ganglia meta daemon (gmetad), and the
ganglia monitor daemon (gmond) on the headnode, as well as adding gmond
to the compute node images. A reboot of compute nodes will be required after this
installation. The web app is also installed on the headnode, under a name-based virtual
host. The servername is configured under group_vars/all
.
The OpenXDMoD role installs the XDMoD webapp on the headnode, under a name-based
virtual host. The servername is configured under group_vars/all
. This role also
creates the xdmod
mysql user, and sets up regular cron jobs for ingesting and
shredding data from the scheduler (slurm by default). For more details on
XDMoD installation, please see
the official XDMoD Configuration Guide.
Configuration¶
Ganglia¶
Ganglia requires little configuration out of the box, but there are many optional customizations available for the web interface. For more information, see:
OpenXDMoD¶
After the initial installation, you will want to create an admin user for OpenXDMoD. This refers only to the web interface, where you can manage XDMoD users and generate detailed reports.
Run the xdmod-setup
utility, and select 5
to create an Admin user. You
will be able to log in to the web interface with this.