»APM Plugins

APMs are used to store metrics about an applications performance and current state. The APM (Application Performance Management) plugin is responsible for querying the APM and returning a value which will be used to determine if scaling should occur.

»Prometheus APM Plugin

Use Prometheus metrics to scale your Nomad job task groups or cluster. The query performed on Prometheus should return a single value. You can use the scalar function in your query to achieve this.

»Agent Configuration Options

apm "prometheus" {
  driver = "prometheus"

  config = {
    address = "http://prometheus.my.endpoint.io:9090"
  }
}
  • address (string: "http://127.0.0.1:9090") - The address of the Prometheus endpoint used to perform queries.

»Policy Configuration Options

check {
  source = "prometheus"
  query  = "scalar(avg((haproxy_server_current_sessions{backend=\"http_back\"}) and (haproxy_server_up{backend=\"http_back\"} == 1)))"
  ...
}

»Datadog APM Plugin

The Datadog APM allows using time series data to make scaling decisions.

»Agent Configuration Options

apm "datadog" {
  driver = "datadog"

  config = {
    dd_api_key = "<api key>"
    dd_app_key = "<app key>"
  }
}
  • dd_api_key (string: "") - The Datadog API key to use for authentication.
  • dd_app_key (string: "") - The Datadog APP key to use for authentication.

The Datadog plugin can also read its configuration options via environment variables. The accepted keys are DD_API_KEY and DD_APP_KEY. The agent configuration parameters take precedence over the environment variables.

»Policy Configuration Options

check {
  source = "datadog"
  query  = "FROM=2m;TO=0m;QUERY=avg:proxy.backend.response.time{proxy-service:web-app}"
  ...
}

The query consists of three sections, each separated using a ; delimiter. More information on the arguments can be found on the Datadog site.

  • FROM - A time offset which indicates the start of the queried time period.

  • TO - A time offset which indicates the end of the queried time period.

  • QUERY - The query string to execute.

»Nomad APM Plugin

The Nomad APM plugin allows querying the Nomad API for metric data. This provides an immediate starting point without addition applications but comes at the price of efficiency. When using this APM, it is advised to monitor Nomad carefully ensuring it is not put under excessive load pressure.

»Agent Configuration Options

apm "nomad-apm" {
  driver = "nomad-apm"
}

When using a Nomad cluster with ACLs enabled, following ACL policy will provide the appropriate permissions for obtaining task group metrics:

namespace "default" {
  policy       = "read"
  capabilities = ["read-job"]
}

In order to obtain cluster level metrics, the following ACL policy will be required:

node {
  policy = "read"
}

namespace "default" {
  policy       = "read"
  capabilities = ["read-job"]
}

»Policy Configuration Options - Task Groups

The Nomad APM allows querying Nomad to understand the current resource usage of a task group.

check {
  source = "nomad-apm"
  query  = "avg_cpu"
  ...
}

Querying Nomad task group metrics is be done using the operation_metric syntax, where valid operations are:

  • avg - returns the average of the metric value across allocations in the task group.

  • min - returns the lowest metric value among the allocations in the task group.

  • max - returns the highest metric value among the allocations in the task group.

  • sum - returns the sum of all the metric values for the allocations in the task group.

The metric value can be:

  • cpu - CPU usage as reported by the nomad.client.allocs.cpu.total_percent metric.

  • memory - Memory usage as reported by the nomad.client.allocs.memory.usage metric.

»Policy Configuration Options - Client Nodes

The Nomad APM allows querying Nomad to understand the current allocated resource as a percentage of the total available.

check {
  source = "nomad-apm"
  query  = "percentage-allocated_cpu"
  ...
}

Querying Nomad client node metrics is be done using the operation_metric syntax, where valid operations are:

The metric value can be:

  • cpu - allocated CPU as reported by calculating total allocatable against the total allocated by the scheduler.

  • memory - allocated memory as reported by calculating total allocatable against the total allocated by the scheduler.