Apache Tez cluster on Docker Janos Matyas 19 September 2014

This week the Apache Tez community announced the release of the 0.5 version of the project. At SequenceIQ first time we came across Tez was in 2013 – after Hortonworks launched the Stinger Initiative. Though we were not using Hive (that might change soon) we have quickly realized the other capabilities of Tez – the expressive data flow API, data movement patterns, dynamic graph reconfiguration, etc – to name a few.

We quickly became fans of Tez – and have started to run internal PoC projects, rewrite ML algorithms and legacy MR2 code to run/leverage Tez. The new release comes with a stable developer API and a proven stability track, and this has triggered a major re-architecture/refactoring project at SequenceIQ. While I don’t want to enter into deep details, we are building a Platform as a Service API – with the first stages of the project already released, open sourced and in public beta:

Cloudbreak – our Docker based cloud agnostic Hadoop as a Service API (AWS, Azure, Google Cloud, DigitalOcean); Periscope – an SLA policy based autoscaling API for Hadoop YARN

One of the unreleased component is a project called Banzai Pipeline – a big data pipeline API (with 50+ pre-built data and job pipes), running on MR2, Tez and Spark.

With all these said, we have put together a Tez Ready Docker based Hadoop cluster to share our excitement and allow you to quickly start and get familiar with the nice features of the Tez API. The cluster is built on our widely used Apache Ambari Docker container, with some additional features. The containers are service discovery aware. You don’t need to setup anything beforehand, configure IP addresses or DNS names – the only thing you will need to do is just specify the number of nodes desired in your cluster, and you are ready to go. If you are interested on the underlying architecture (using Docker, Serf and dnsmasq) you can check my slides/presentation from the Hadoop Summit.

I’d like to highlight one important feature of Tez – us being crazy about automation/DevOps – the simplicity and the capability of running multiple versions of Tez on the same YARN cluster. We are contributors to many Apache projects (Hadoop, YARN, Ambari, etc) and since we have started to use Tez we consider to contribute there as well (at the end of the day will be a core part of our platform). Adding new features, changing code or fixing bugs always introduce undesired features – nevertheless, the Tez binaries built by different colleagues can be tested at scale, using the same cluster without affecting each others work. Check Gopal V’s good introduction about Tez and DevOps.

Apache Tez cluster on Docker

The container’s code is available on our GitHub repository.

Pull the image from the Docker Repository

We suggest to always pull the container from the official Docker repository – as this is always maintained and supported by us.

1
docker pull sequenceiq/ambari:1.7.0-ea-tez

Building the image

Alternatively you can always build your own container based on our Dockerfile.

1
docker build --rm -t sequenceiq/ambari:1.7.0-ea-tez ambari-server/

Running the cluster

We have put together a few shell functions to simplify your work, so before you start make sure you get the following ambari-functions file.

1
curl -Lo .amb j.mp/docker-ambari-tez && . .amb

Create your Apache Tez cluster

You are almost there. The only thing you will need to do is to specify the number of nodes you need in your cluster. We will launch the containers, they will dynamically join the cluster and apply the Tez specific configurations.

1
amb-deploy-cluster 4

Once the cluster is started you can enter in the container and submit your custom Tez application or use one of the stock Tez examples.

Check back next week, as we are releasing real world examples running on three different big data fabrics: Tez, MR2 and Spark.

Should you have any questions let us know through our social channels using LinkedIn, Twitter or Facebook.

Comments

Recent Posts