The Nomad client and server agents collect a wide range of runtime metrics related to the performance of the system. Operators can use this data to gain real-time visibility into their cluster and improve performance. Additionally, Nomad operators can set up monitoring and alerting based on these metrics in order to respond to any changes in the cluster state.
On the server side, leaders and followers have metrics in common as well as metrics that are specific to their roles. Clients have separate metrics for the host metrics and for allocations/tasks, both of which have to be explicitly enabled. There are also runtime metrics that are common to all servers and clients.
There are three ways to obtain metrics from Nomad:
- Query the /metrics API endpoint to return metrics for the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus formatted metrics.
- Send the USR1 signal to the Nomad process. This will dump the current telemetry information to STDERR (on Linux).
- Configure Nomad to automatically forward metrics to a third-party provider.
The recommended practice for alerting is to leverage the alerting capabilities of your monitoring provider. Nomad’s intention is to surface metrics that enable users to configure the necessary alerts using their existing monitoring systems as a scaffold, rather than to natively support alerting. Here are a few common patterns:
Export metrics from Nomad to Prometheus using the StatsD exporter, define alerting rules in Prometheus, and use Alertmanager for summarization and routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is supported for Datadog.
Periodically submit test jobs into Nomad to determine if your application deployment pipeline is working end-to-end. This pattern is well-suited to batch processing workloads.
Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios monitor when a new Nomad job is added. When a job is removed, remove the Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a job-specific alerting system.
Write a script that looks at the history of each batch job to determine whether or not the job is in an unhealthy state, updating your monitoring system as appropriate. In many cases, it may be ok if a given batch job fails occasionally, as long as it goes back to passing.
»Key Performance Indicators
The sections below cover a number of important metrics
»Consensus Protocol (Raft)
Nomad uses the Raft consensus protocol for leader election and state
replication. Spurious leader elections can be caused by networking issues
between the servers or insufficient CPU resources. Users in cloud environments
often bump their servers up to the next instance class with improved networking
and CPU to stabilize leader elections. The
is a general indicator of Raft latency which can be used to observe how Raft
timing is performing and guide the decision to upgrade to more powerful servers.
nomad.raft.leader.lastContact should not get too close to the leader lease
timeout of 500ms.
»Federated Deployments (Serf)
Nomad uses the membership and failure detection capabilities of the Serf library
to maintain a single, global gossip pool for all servers in a federated
deployment. An uptick in
msg.suspect is a reliable indicator
that membership is unstable.
The following metrics allow an operator to observe changes in throughput at the various points in the scheduling process (evaluation, scheduling/planning, and placement):
- nomad.broker.total_blocked - The number of blocked evaluations.
- nomad.worker.invoke_scheduler.\<type> - The time to run the scheduler of the given type.
- nomad.plan.evaluate - The time to evaluate a scheduler Plan.
- nomad.plan.submit - The time to submit a scheduler Plan.
- nomad.plan.queue_depth - The number of scheduler Plans waiting to be evaluated.
Upticks in any of the above metrics indicate a decrease in scheduler throughput.
The importance of monitoring resource availability is workload specific. Batch processing workloads often operate under the assumption that the cluster should be at or near capacity, with queued jobs running as soon as adequate resources become available. Clusters that are primarily responsible for long running services with an uptime requirement may want to maintain headroom at 20% or more. The following metrics can be used to assess capacity across the cluster on a per client basis:
»Task Resource Consumption
The metrics listed here can be used to track resource consumption on a per task basis. For user facing services, it is common to alert when the CPU is at or above the reserved resources for the task.
»Job and Task Status
We do not currently surface metrics for job and task/allocation status, although we will consider adding metrics where it makes sense.
Runtime metrics apply to all clients and servers. The following metrics are general indicators of load and memory pressure:
It is recommended to alert on upticks in any of the above, server memory usage in particular.