» Telemetry Overview

The Nomad client and server agents collect a wide range of runtime metrics related to the performance of the system. On the server side, leaders and followers have metrics in common as well as metrics that are specific to their roles. Clients have separate metrics for the host metrics and for allocations/tasks, both of which have to be explicitly enabled. There are also runtime metrics that are common to all servers and clients.

By default, the Nomad agent collects telemetry data at a 1 second interval. Note that Nomad supports Gauges, counters and timers.

There are three ways to obtain metrics from Nomad:

  • Query the /metrics API endpoint to return metrics for the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus formatted metrics.
  • Send the USR1 signal to the Nomad process. This will dump the current telemetry information to STDERR (on Linux).
  • Configure Nomad to automatically forward metrics to a third-party provider.

Nomad 0.7 added support for tagged metrics, improving the integrations with DataDog and Prometheus. Metrics can also be forwarded to Statsite, StatsD, and Circonus.

» Alerting

The recommended practice for alerting is to leverage the alerting capabilities of your monitoring provider. Nomad’s intention is to surface metrics that enable users to configure the necessary alerts using their existing monitoring systems as a scaffold, rather than to natively support alerting. Here are a few common patterns:

  • Export metrics from Nomad to Prometheus using the StatsD exporter, define alerting rules in Prometheus, and use Alertmanager for summarization and routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is supported for Datadog.

  • Periodically submit test jobs into Nomad to determine if your application deployment pipeline is working end-to-end. This pattern is well-suited to batch processing workloads.

  • Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios monitor when a new Nomad job is added. When a job is removed, remove the Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a job-specific alerting system.

  • Write a script that looks at the history of each batch job to determine whether or not the job is in an unhealthy state, updating your monitoring system as appropriate. In many cases, it may be ok if a given batch job fails occasionally, as long as it goes back to passing.

» Key Performance Indicators

The sections below cover a number of important metrics

» Consensus Protocol (Raft)

Nomad uses the Raft consensus protocol for leader election and state replication. Spurious leader elections can be caused by networking issues between the servers or insufficient CPU resources. Users in cloud environments often bump their servers up to the next instance class with improved networking and CPU to stabilize leader elections. The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide the decision to upgrade to more powerful servers. nomad.raft.leader.lastContact should not get too close to the leader lease timeout of 500ms.

» Federated Deployments (Serf)

Nomad uses the membership and failure detection capabilities of the Serf library to maintain a single, global gossip pool for all servers in a federated deployment. An uptick in member.flap and/or msg.suspect is a reliable indicator that membership is unstable.

» Scheduling

The following metrics allow an operator to observe changes in throughput at the various points in the scheduling process (evaluation, scheduling/planning, and placement):

  • nomad.broker.total_blocked - The number of blocked evaluations.
  • nomad.worker.invoke_scheduler.<type> - The time to run the scheduler of the given type.
  • nomad.plan.evaluate - The time to evaluate a scheduler Plan.
  • nomad.plan.submit - The time to submit a scheduler Plan.
  • nomad.plan.queue_depth - The number of scheduler Plans waiting to be evaluated.

Upticks in any of the above metrics indicate a decrease in scheduler throughput.

» Capacity

The importance of monitoring resource availability is workload specific. Batch processing workloads often operate under the assumption that the cluster should be at or near capacity, with queued jobs running as soon as adequate resources become available. Clusters that are primarily responsible for long running services with an uptime requirement may want to maintain headroom at 20% or more. The following metrics can be used to assess capacity across the cluster on a per client basis:

  • nomad.client.allocated.cpu
  • nomad.client.unallocated.cpu
  • nomad.client.allocated.disk
  • nomad.client.unallocated.disk
  • nomad.client.allocated.iops
  • nomad.client.unallocated.iops
  • nomad.client.allocated.memory
  • nomad.client.unallocated.memory

» Task Resource Consumption

The metrics listed here can be used to track resource consumption on a per task basis. For user facing services, it is common to alert when the CPU is at or above the reserved resources for the task.

» Job and Task Status

We do not currently surface metrics for job and task/allocation status, although we will consider adding metrics where it makes sense.

» Runtime Metrics

Runtime metrics apply to all clients and servers. The following metrics are general indicators of load and memory pressure:

  • nomad.runtime.num_goroutines
  • nomad.runtime.heap_objects
  • nomad.runtime.alloc_bytes

It is recommended to alert on upticks in any of the above, server memory usage in particular.