» Telemetry

The Nomad agent collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute.

This data can be accessed via an HTTP endpoint or via sending a signal to the Nomad process.

Via HTTP, as of Nomad version 0.7, this data is available at /metrics. See Metrics for more information.

To view this data via sending a signal to the Nomad process: on Unix, this is USR1 while on Windows it is BREAK. Once Nomad receives the signal, it will dump the current telemetry information to the agent's stderr.

This telemetry information can be used for debugging or otherwise getting a better view of what Nomad is doing.

Telemetry information can be streamed to both statsite as well as statsd based on providing the appropriate configuration options.

To configure the telemetry output please see the agent configuration.

Below is sample output of a telemetry dump:

[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204

» Key Metrics

When telemetry is being streamed to statsite or statsd, interval is defined to be their flush interval. Otherwise, the interval can be assumed to be 10 seconds when retrieving metrics using the above described signals.

Metric Description Unit Type
nomad.runtime.num_goroutines Number of goroutines and general load pressure indicator # of goroutines Gauge
nomad.runtime.alloc_bytes Memory utilization # of bytes Gauge
nomad.runtime.heap_objects Number of objects on the heap. General memory pressure indicator # of heap objects Gauge
nomad.raft.apply Number of Raft transactions Raft transactions / interval Counter
nomad.raft.replication.appendEntries Raft transaction commit time ms / Raft Log Append Timer
nomad.raft.leader.lastContact Time since last contact to leader. General indicator of Raft latency ms / Leader Contact Timer
nomad.broker.total_ready Number of evaluations ready to be processed # of evaluations Gauge
nomad.broker.total_unacked Evaluations dispatched for processing but incomplete # of evaluations Gauge
nomad.broker.total_blocked Evaluations that are blocked until an existing evaluation for the same job completes # of evaluations Gauge
nomad.plan.queue_depth Number of scheduler Plans waiting to be evaluated # of plans Gauge
nomad.plan.submit Time to submit a scheduler Plan. Higher values cause lower scheduling throughput ms / Plan Submit Timer
nomad.plan.evaluate Time to validate a scheduler Plan. Higher values cause lower scheduling throughput. Similar to nomad.plan.submit but does not include RPC time or time in the Plan Queue ms / Plan Evaluation Timer
nomad.worker.invoke_scheduler.<type> Time to run the scheduler of the given type ms / Scheduler Run Timer
nomad.worker.wait_for_index Time waiting for Raft log replication from leader. High delays result in lower scheduling throughput ms / Raft Index Wait Timer
nomad.heartbeat.active Number of active heartbeat timers. Each timer represents a Nomad Client connection # of heartbeat timers Gauge
nomad.heartbeat.invalidate The length of time it takes to invalidate a Nomad Client due to failed heartbeats ms / Heartbeat Invalidation Timer
nomad.rpc.query Number of RPC queries RPC Queries / interval Counter
nomad.rpc.request Number of RPC requests being handled RPC Requests / interval Counter
nomad.rpc.request_error Number of RPC requests being handled that result in an error RPC Errors / interval Counter

» Client Metrics

The Nomad client emits metrics related to the resource usage of the allocations and tasks running on it and the node itself. Operators have to explicitly turn on publishing host and allocation metrics. Publishing allocation and host metrics can be turned on by setting the value of publish_allocation_metrics publish_node_metrics to true.

By default the collection interval is 1 second but it can be changed by the changing the value of the collection_interval key in the telemetry configuration block.

Please see the agent configuration page for more details.

As of Nomad 0.9, Nomad will emit additional labels for parameterized and periodic jobs. Nomad emits the parent job id as a new label parent_id. Also, the labels dispatch_id and periodic_id are emitted, containing the ID of the specific invocation of the parameterized or periodic job respectively. For example, a dispatch job with the id myjob/dispatch-1312323423423, will have the following labels.

Label Value
job myjob/dispatch-1312323423423
parent_id myjob
dispatch_id 1312323423423

» Host Metrics (post Nomad version 0.7)

Starting in version 0.7, Nomad will emit tagged metrics, in the below format:

Metric Description Unit Type Labels
nomad.client.allocated.cpu Total amount of CPU shares the scheduler has allocated to tasks MHz Gauge node_id, datacenter
nomad.client.unallocated.cpu Total amount of CPU shares free for the scheduler to allocate to tasks MHz Gauge node_id, datacenter
nomad.client.allocated.memory Total amount of memory the scheduler has allocated to tasks Megabytes Gauge node_id, datacenter
nomad.client.unallocated.memory Total amount of memory free for the scheduler to allocate to tasks Megabytes Gauge node_id, datacenter
nomad.client.allocated.disk Total amount of disk space the scheduler has allocated to tasks Megabytes Gauge node_id, datacenter
nomad.client.unallocated.disk Total amount of disk space free for the scheduler to allocate to tasks Megabytes Gauge node_id, datacenter
nomad.client.allocated.iops Total amount of IOPS the scheduler has allocated to tasks IOPS Gauge node_id, datacenter
nomad.client.unallocated.iops Total amount of IOPS free for the scheduler to allocate to tasks IOPS Gauge node_id, datacenter
nomad.client.allocated.network Total amount of bandwidth the scheduler has allocated to tasks on the given device Megabits Gauge node_id, datacenter, device
nomad.client.unallocated.network Total amount of bandwidth free for the scheduler to allocate to tasks on the given device Megabits Gauge node_id, datacenter, device
nomad.client.host.memory.total Total amount of physical memory on the node Bytes Gauge node_id, datacenter
nomad.client.host.memory.available Total amount of memory available to processes which includes free and cached memory Bytes Gauge node_id, datacenter
nomad.client.host.memory.used Amount of memory used by processes Bytes Gauge node_id, datacenter
nomad.client.host.memory.free Amount of memory which is free Bytes Gauge node_id, datacenter
nomad.client.uptime Uptime of the host running the Nomad client Seconds Gauge node_id, datacenter
nomad.client.host.cpu.total Total CPU utilization Percentage Gauge node_id, datacenter, cpu
nomad.client.host.cpu.user CPU utilization in the user space Percentage Gauge node_id, datacenter, cpu
nomad.client.host.cpu.system CPU utilization in the system space Percentage Gauge node_id, datacenter, cpu
nomad.client.host.cpu.idle Idle time spent by the CPU Percentage Gauge node_id, datacenter, cpu
nomad.client.host.disk.size Total size of the device Bytes Gauge node_id, datacenter, disk
nomad.client.host.disk.used Amount of space which has been used Bytes Gauge node_id, datacenter, disk
nomad.client.host.disk.available Amount of space which is available Bytes Gauge node_id, datacenter, disk
nomad.client.host.disk.used_percent Percentage of disk space used Percentage Gauge node_id, datacenter, disk
nomad.client.host.disk.inodes_percent Disk space consumed by the inodes Percent Gauge node_id, datacenter, disk
nomad.client.allocs.start Number of allocations starting Integer Counter node_id, job, task_group
nomad.client.allocs.running Number of allocations starting to run Integer Counter node_id, job, task_group
nomad.client.allocs.failed Number of allocations failing Integer Counter node_id, job, task_group
nomad.client.allocs.restart Number of allocations restarting Integer Counter node_id, job, task_group
nomad.client.allocs.complete Number of allocations completing Integer Counter node_id, job, task_group
nomad.client.allocs.destroy Number of allocations being destroyed Integer Counter node_id, job, task_group

Nomad 0.9 adds an additional "node_class" label from the client's NodeClass attribute. This label is set to the string "none" if empty.

» Host Metrics (deprecated post Nomad 0.7)

The below are metrics emitted by Nomad in versions prior to 0.7. These metrics can be emitted in the below format post-0.7 (as well as the new format, detailed above) but any new metrics will only be available in the new format.

Metric Description Unit Type
nomad.client.allocated.cpu.<HostID> Total amount of CPU shares the scheduler has allocated to tasks MHz Gauge
nomad.client.unallocated.cpu.<HostID> Total amount of CPU shares free for the scheduler to allocate to tasks MHz Gauge
nomad.client.allocated.memory.<HostID> Total amount of memory the scheduler has allocated to tasks Megabytes Gauge
nomad.client.unallocated.memory.<HostID> Total amount of memory free for the scheduler to allocate to tasks Megabytes Gauge
nomad.client.allocated.disk.<HostID> Total amount of disk space the scheduler has allocated to tasks Megabytes Gauge
nomad.client.unallocated.disk.<HostID> Total amount of disk space free for the scheduler to allocate to tasks Megabytes Gauge
nomad.client.allocated.iops.<HostID> Total amount of IOPS the scheduler has allocated to tasks IOPS Gauge
nomad.client.unallocated.iops.<HostID> Total amount of IOPS free for the scheduler to allocate to tasks IOPS Gauge
nomad.client.allocated.network.<Device-Name>.<HostID> Total amount of bandwidth the scheduler has allocated to tasks on the given device Megabits Gauge
nomad.client.unallocated.network.<Device-Name>.<HostID> Total amount of bandwidth free for the scheduler to allocate to tasks on the given device Megabits Gauge
nomad.client.host.memory.<HostID>.total Total amount of physical memory on the node Bytes Gauge
nomad.client.host.memory.<HostID>.available Total amount of memory available to processes which includes free and cached memory Bytes Gauge
nomad.client.host.memory.<HostID>.used Amount of memory used by processes Bytes Gauge
nomad.client.host.memory.<HostID>.free Amount of memory which is free Bytes Gauge
nomad.client.uptime.<HostID> Uptime of the host running the Nomad client Seconds Gauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.total Total CPU utilization Percentage Gauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.user CPU utilization in the user space Percentage Gauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.system CPU utilization in the system space Percentage Gauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.idle Idle time spent by the CPU Percentage Gauge
nomad.client.host.disk.<HostID>.<Device-Name>.size Total size of the device Bytes Gauge
nomad.client.host.disk.<HostID>.<Device-Name>.used Amount of space which has been used Bytes Gauge
nomad.client.host.disk.<HostID>.<Device-Name>.available Amount of space which is available Bytes Gauge
nomad.client.host.disk.<HostID>.<Device-Name>.used_percent Percentage of disk space used Percentage Gauge
nomad.client.host.disk.<HostID>.<Device-Name>.inodes_percent Disk space consumed by the inodes Percent Gauge

» Allocation Metrics

Metric Description Unit Type
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rss Amount of RSS memory consumed by the task Bytes Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.cache Amount of memory cached by the task Bytes Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.swap Amount of memory swapped by the task Bytes Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usage Maximum amount of memory ever used by the task Bytes Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_usage Amount of memory used by the kernel for this task Bytes Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_max_usage Maximum amount of memory ever used by the kernel for this task Bytes Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_percent Total CPU resources consumed by the task across all cores Percentage Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.system Total CPU resources consumed by the task in the system space Percentage Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.user Total CPU resources consumed by the task in the user space Percentage Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.throttled_time Total time that the task was throttled Nanoseconds Gauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_ticks CPU ticks consumed by the process in the last collection interval Integer Gauge

» Job Metrics

Job metrics are emitted by the Nomad leader server.

Metric Description Unit Type Labels
nomad.job_summary.queued Number of queued allocations for a job Integer Gauge job, task_group
nomad.job_summary.complete Number of complete allocations for a job Integer Gauge job, task_group
nomad.job_summary.failed Number of failed allocations for a job Integer Gauge job, task_group
nomad.job_summary.running Number of running allocations for a job Integer Gauge job, task_group
nomad.job_summary.starting Number of starting allocations for a job Integer Gauge job, task_group
nomad.job_summary.lost Number of lost allocations for a job Integer Gauge job, task_group

» Metric Types

Type Description Quantiles
Gauge Gauge types report an absolute number at the end of the aggregation interval false
Counter Counts are incremented and flushed at the end of the aggregation interval and then are reset to zero true
Timer Timers measure the time to complete a task and will include quantiles, means, standard deviation, etc per interval. true