» Preemption for Service and Batch Jobs

Prior to Nomad 0.9, job priority in Nomad was used to process scheduling requests in priority order. Preemption, implemented in Nomad 0.9 allows Nomad to evict running allocations to place allocations of a higher priority. Allocations of a job that are blocked temporarily go into "pending" status until the cluster has additional capacity to run them. This is useful when operators need to run relatively higher priority tasks sooner even under resource contention across the cluster.

While Nomad 0.9 introduced preemption for system jobs, Nomad 0.9.3 Enterprise additionally allows preemption for service and batch jobs. This functionality can easily be enabled by sending a payload with the appropriate options specified to the scheduler configuration API endpoint.

» Reference Material

» Estimated Time to Complete

20 minutes

» Prerequisites

To perform the tasks described in this guide, you need to have a Nomad environment with Consul installed. You can use this repo to easily provision a sandbox environment. This guide will assume a cluster with one server node and three client nodes. To simulate resource contention, the nodes in this environment will each have 1 GB RAM (For AWS, you can choose the t2.micro instance type). Remember that service and batch job preemption require Nomad 0.9.3 Enterprise.

» Steps

» Step 1: Create a Job with Low Priority

Start by creating a job with relatively lower priority into your Nomad cluster. One of the allocations from this job will be preempted in a subsequent deployment when there is a resource contention in the cluster. Copy the following job into a file and name it webserver.nomad.

job "webserver" {
  datacenters = ["dc1"]
  type        = "service"
  priority    = 40

  group "webserver" {
    count = 3

    task "apache" {
      driver = "docker"

      config {
        image = "httpd:latest"

        port_map {
          http = 80
        }
      }

      resources {
        network {
          mbits = 10
          port  "http"{}
        }

        memory = 600
      }

      service {
        name = "apache-webserver"
        port = "http"

        check {
          name     = "alive"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Note that the count is 3 and that each allocation is specifying 600 MB of memory. Remember that each node only has 1 GB of RAM.

» Step 2: Run the Low Priority Job

Register webserver.nomad:

$ nomad run webserver.nomad 
==> Monitoring evaluation "1596bfc8"
    Evaluation triggered by job "webserver"
    Allocation "725d3b49" created: node "16653ac1", group "webserver"
    Allocation "e2f9cb3d" created: node "f765c6e8", group "webserver"
    Allocation "e9d8df1b" created: node "b0700ec0", group "webserver"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1596bfc8" finished with status "complete"

You should be able to check the status of the webserver job at this point and see that an allocation has been placed on each client node in the cluster:

$ nomad status webserver
ID            = webserver
Name          = webserver
Submit Date   = 2019-06-19T04:20:32Z
Type          = service
Priority      = 40
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
725d3b49  16653ac1  webserver   0        run      running  1m18s ago  59s ago
e2f9cb3d  f765c6e8  webserver   0        run      running  1m18s ago  1m2s ago
e9d8df1b  b0700ec0  webserver   0        run      running  1m18s ago  59s ago

» Step 3: Create a Job with High Priority

Create another job with a priority greater than the job you just deployed. Copy the following into a file named redis.nomad:

job "redis" {
  datacenters = ["dc1"]
  type        = "service"
  priority    = 80

  group "cache1" {
    count = 1

    task "redis" {
      driver = "docker"

      config {
        image = "redis:latest"

        port_map {
          db = 6379
        }
      }

      resources {
        network {
          port "db" {}
        }

        memory = 700
      }

      service {
        name = "redis-cache"
        port = "db"

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Note that this job has a priority of 80 (greater than the priority of the job from Step 1) and requires 700 MB of memory. This allocation will create a resource contention in the cluster since each node only has 1 GB of memory with a 600 MB allocation already placed on it.

» Step 4: Try to Run redis.nomad

Remember that preemption for service and batch jobs are disabled by default. This means that the redis job will be queued due to resource contention in the cluster. You can verify the resource contention before actually registering your job by running the plan command:

$ nomad plan redis.nomad 
+ Job: "redis"
+ Task Group: "cache1" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache1" (failed to place 1 allocation):
    * Resources exhausted on 3 nodes
    * Dimension "memory" exhausted on 3 nodes

Run the job to see that the allocation will be queued:

$ nomad run redis.nomad 
==> Monitoring evaluation "1e54e283"
    Evaluation triggered by job "redis"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1e54e283" finished with status "complete" but failed to place all allocations:
    Task Group "cache1" (failed to place 1 allocation):
      * Resources exhausted on 3 nodes
      * Dimension "memory" exhausted on 3 nodes
    Evaluation "1512251a" waiting for additional capacity to place remainder

You may also verify the allocation has been queued by now checking the status of the job:

$ nomad status redis
ID            = redis
Name          = redis
Submit Date   = 2019-06-19T03:33:17Z
Type          = service
Priority      = 80
...
Placement Failure
Task Group "cache1":
  * Resources exhausted on 3 nodes
  * Dimension "memory" exhausted on 3 nodes

Allocations
No allocations placed

You may remove this job now. In the next steps, we will enable service job preemption and re-deploy:

$ nomad stop -purge redis
==> Monitoring evaluation "153db6c0"
    Evaluation triggered by job "redis"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "153db6c0" finished with status "complete"

» Step 5: Enable Service Job Preemption

Verify the scheduler configuration with the following command:

$ curl -s localhost:4646/v1/operator/scheduler/configuration | jq
{
  "SchedulerConfig": {
    "PreemptionConfig": {
      "SystemSchedulerEnabled": true,
      "BatchSchedulerEnabled": false,
      "ServiceSchedulerEnabled": false
    },
    "CreateIndex": 5,
    "ModifyIndex": 506
  },
  "Index": 506,
  "LastContact": 0,
  "KnownLeader": true
}

Note that BatchSchedulerEnabled and ServiceSchedulerEnabled are both set to false by default. Since we are preempting service jobs in this guide, we need to set ServiceSchedulerEnabled to true. We will do this by directly interacting with the API.

Create the following JSON payload and place it in a file named scheduler.json:

{
  "PreemptionConfig": {
    "SystemSchedulerEnabled": true,
    "BatchSchedulerEnabled": false,
    "ServiceSchedulerEnabled": true
  }
}

Note that ServiceSchedulerEnabled has been set to true.

Run the following command to update the scheduler configuration:

$ curl -XPOST localhost:4646/v1/operator/scheduler/configuration -d @scheduler.json

You should now be able to check the scheduler configuration again and see that preemption has been enabled for service jobs (output below is abbreviated):

$ curl -s localhost:4646/v1/operator/scheduler/configuration | jq
{
  "SchedulerConfig": {
    "PreemptionConfig": {
      "SystemSchedulerEnabled": true,
      "BatchSchedulerEnabled": false,
      "ServiceSchedulerEnabled": true
    },
...
}

» Step 6: Try Running redis.nomad Again

Now that you have enabled preemption on service jobs, deploying your redis job should evict one of the lower priority webserver allocations and place it into a queue. You can run nomad plan to see a preview of what will happen:

$ nomad plan redis.nomad 
+ Job: "redis"
+ Task Group: "cache1" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Preemptions:

Alloc ID                              Job ID     Task Group
725d3b49-d5cf-6ba2-be3d-cb441c10a8b3  webserver  webserver
...

Note that Nomad is indicating one of the webserver allocations will be evicted.

Now run the redis job:

$ nomad run redis.nomad 
==> Monitoring evaluation "7ada9d9f"
    Evaluation triggered by job "redis"
    Allocation "8bfcdda3" created: node "16653ac1", group "cache1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7ada9d9f" finished with status "complete"

You can check the status of the webserver job and verify one of the allocations has been evicted:

$ nomad status webserver
ID            = webserver
Name          = webserver
Submit Date   = 2019-06-19T04:20:32Z
Type          = service
Priority      = 40
...
Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
webserver   1       0         2        0       1         0

Placement Failure
Task Group "webserver":
  * Resources exhausted on 3 nodes
  * Dimension "memory" exhausted on 3 nodes

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
725d3b49  16653ac1  webserver   0        evict    complete  4m10s ago  33s ago
e2f9cb3d  f765c6e8  webserver   0        run      running   4m10s ago  3m54s ago
e9d8df1b  b0700ec0  webserver   0        run      running   4m10s ago  3m51s ago

» Step 7: Stop the Redis Job

Stop the redis job and verify that evicted/queued webserver allocation starts running again:

$ nomad stop redis
==> Monitoring evaluation "670922e9"
    Evaluation triggered by job "redis"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "670922e9" finished with status "complete"

You should now be able to see from the webserver status that the third allocation that was previously preempted is running again:

$ nomad status webserver
ID            = webserver
Name          = webserver
Submit Date   = 2019-06-19T04:20:32Z
Type          = service
Priority      = 40
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
webserver   0       0         3        0       1         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
f623eb81  16653ac1  webserver   0        run      running   13s ago    7s ago
725d3b49  16653ac1  webserver   0        evict    complete  6m44s ago  3m7s ago
e2f9cb3d  f765c6e8  webserver   0        run      running   6m44s ago  6m28s ago
e9d8df1b  b0700ec0  webserver   0        run      running   6m44s ago  6m25s ago

» Next Steps

The process you learned in this guide can also be applied to batch jobs as well. Read more about preemption in Nomad Enterprise here.