» Using HDFS

HDFS is a distributed, replicated and scalable file system written for the Hadoop framework. Spark was designed to read from and write to HDFS, since it is common for Spark applications to perform data-intensive processing over large datasets. HDFS can be deployed as its own Nomad job.

» Running HDFS on Nomad

A sample HDFS job file can be found here. It has two task groups, one for the HDFS NameNode and one for the DataNodes. Both task groups use a Docker image that includes Hadoop:

  group "NameNode" {

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    task "NameNode" {

      driver = "docker"

      config {
        image = "rcgenova/hadoop-2.7.3"
        command = "bash"
        args = [ "-c", "hdfs namenode -format && exec hdfs namenode 
          -D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ]
        network_mode = "host"
        port_map {
          ipc = 8020
          ui = 50070
        }
      }

      resources {
        cpu    = 1000
        memory = 1024
        network {
          port "ipc" {
            static = "8020"
          }
          port "ui" {
            static = "50070"
          }
        }
      }

      service {
        name = "hdfs"
        port = "ipc"
      }
    }
  }

The NameNode task registers itself in Consul as hdfs. This enables the DataNodes to generically reference the NameNode:

  group "DataNode" {

    count = 3

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    task "DataNode" {

      driver = "docker"

      config {
        network_mode = "host"
        image = "rcgenova/hadoop-2.7.3"
        args = [ "hdfs", "datanode"
          , "-D", "fs.defaultFS=hdfs://hdfs.service.consul/"
          , "-D", "dfs.permissions.enabled=false"
        ]
        port_map {
          data = 50010
          ipc = 50020
          ui = 50075
        }
      }

      resources {
        cpu    = 1000
        memory = 1024
        network {
          port "data" {
            static = "50010"
          }
          port "ipc" {
            static = "50020"
          }
          port "ui" {
            static = "50075"
          }
        }
      }

    }
  }

Another viable option for DataNode task group is to use a dedicated system job. This will deploy a DataNode to every client node in the system, which may or may not be desirable depending on your use case.

The HDFS job can be deployed using the nomad run command:

$ nomad run hdfs.nomad

» Production Deployment Considerations

A production deployment will typically have redundant NameNodes in an active/passive configuration (which requires ZooKeeper). See HDFS High Availability.

» Next Steps

Learn how to monitor the output of your Spark applications.