Ambari provisioned Hadoop cluster on Docker Lajos Papp 17 June 2014

We are getting close to release and open source our Docker-based Hadoop Provisioning project. The slides were presented recently at the Hadoop Summit, and there is an interest from the community to learn the technical details.

The project – called Cloudbreak – will provide a REST API to provision a Hadoop cluster – anywhere. The cluster can be hosted on AWS EC2, Azure, physical servers or even your laptop – we are adding more providers – but always based on the same concept: Apache Ambari managed Docker containers.

This blog entry is the first in a series, where we describe the Docker layer step-by-step:

  • Single-node Docker based Hadoop “cluster” locally
  • Multi-node Docker based Hadoop cluster
  • Multi-node Docker based Hadoop cluster on EC2
  • Cloudbreak

Get Docker

The only required software is Docker, so if you don’t have it yet, jump to the installation section of the official documentation.

The very basic you need to work with Docker containers, is described in the users guide.

Single-node Cluster

All setup is based on Docker images only the glue-code is different. Let’s start with the most simple setup:

  • start the first Docker container in the background that runs ambari-server and ambari-agent.
  • start the second Docker container which:
    • waits for the agent connecting to the server
    • starts an ambari-shell, which will instruct ambari-server on its REST API:
      • define an Ambari Blueprint by posting a JSON to <AMBARI_URL>/api/v1/blueprints
      • create a Hadoop cluster by posting a JSON to <AMBARI_URL>/api/v1/clusters using the blueprint created in the previous step
1
2
docker run -d -p 8080 -h amb0.mycorp.kom --name ambari-singlenode sequenceiq/ambari:1.6.0 --tag ambari-server=true
docker run -e BLUEPRINT=single-node-hdfs-yarn --link ambari-singlenode:ambariserver -t --rm --entrypoint /bin/sh sequenceiq/ambari:1.6.0 -c /tmp/install-cluster.sh

or if you want a twitter-sized one-liner to start with Hadoop in less than a minute:

1
curl -LOs j.mp/ambari-singlenode && . ambari-singlenode

When you pull the sequenceiq/ambari image first it will take a couple of minutes (for me it was 4 minutes). Meanwhile you have started and running the download let’s explain all those parameters.

First container: ambari-server and ambari-agent

Let’s break down the parameters of the first container:

1
docker run -d -p 8080 -h amb0.mycorp.kom --name ambari-singlenode sequenceiq/ambari:1.6.0 --tag ambari-server=true
  • -d : Detached mode, container runs in the background
  • -p 8080 : Publish ambari web and REST API port
  • -h amb0.mycorp.kom : hostname
  • —name ambari-singlenode : assign a name to the container
  • sequenceiq/ambari:1.6.0 : the name of the image
  • —tag ambari-server=true : the command but please note that this is appended to the entrypoint.

The default entrypoint of the image is start-serf-agent.sh see the Dockerfile so the --tag ambari-server=true command is actually an argument of the serf agent.

Serf

What is Serf? The definition goes like:

Serf is a decentralized solution for cluster membership, failure detection, and orchestration. Lightweight and highly available.

Right now it doesn’t seem to make any sense to talk about membership and cluster, but remember we want to have the exact same process/tools for dev env and production.

The only Serf feature we use at this point is that you can define shell scripts based event-handlers for each membership events:

  • member-join
  • member-failed
  • member-leave
  • member-xxx

The member-join event-handler script will check the Serf tags, defined by --tag name=value and will start: – ambari-server java process: if the ambari-server tag is true – ambari-agent python process: if the ambari-agent tag is true

You might noted that only the ambari-server tag is defined. The reason is that ambari-agent is defined as true by default.

Second container: ambari-shell

1
docker run -e BLUEPRINT=single-node-hdfs-yarn --link ambari-singlenode:ambariserver -t --rm --entrypoint /bin/sh sequenceiq/ambari:1.6.0 -c /tmp/install-cluster.sh
  • -e BLUEPRINT=single-node-hdfs-yarn : the template to use for the cluster (single-node-hdfs-yarn/multi-node-hdfs-yarn/lambda-architecture) see the blueprint JSON on GitHub
  • —link ambari-singlenode:ambariserver : it will make all exposed ports and the private IP of ambari-singlenode available as AMBARISERVER_xxx env variables
  • -t : pseudo terminal, to see the progress
  • —rm : remove the container once it’s finished
  • —entrypoint /bin/sh : the default entrypoint runs the shell in interactive mode, we want to overwrite it with a script specified as /tmp/install-cluster.sh

Install completed

Once Ambari-shell completed with the installation, you are ready to use it. To find out the IP of the Ambari server run:

1
docker inspect -f "" ambari-singlenode

To start with you can browse ambari web ui on port 8080. The default username/password is admin/admin.

or if you can’t reach directly the private IP of the container (windows users), use the port exposed to the host:

1
docker port ambari-singlenode 8080

Next steps

In the upcomming blog posts we will do a multinode Hadoop cluster with the same toolset, so stay tuned …

Comments

Recent Posts