Two weeks ago we have open sourced our cloud agnostic and Docker based Hadoop as a Service API – called Cloudbreak. The first public beta version supports Amazon’s AWS and Microsoft’s Azure, while we are already wrapping up a few new cloud provider integrations.
While there is some documentation about running Docker containers on Amazon, there is no detailed description about running Docker on the Azure cloud. With this blog post we would like to shed some light on it – recently there have been lots of announcements from Microsoft about Docker support (Azure CLI, Kubernetes, libswarm) but they are either not finished yet or are not ready to build a robust platform on top. We are eagerly waiting for the Kubernetes integration.
In the meantime, if you are interested in running a
cluster of Docker container, or do some more complex stuff then read on.
Just to briefly recap – with Cloudbreak we are launching on demand Hadoop clusters (check our blog for further technical details) in Docker containers. These containers are
shipped to different cloud VMsm and dynamically find and join each other – they form a fully functional Hadoop cluster without the need to do anything manually on the host, or apply any manual pre-configuration.
So how are we doing this?
Docker ready base VM image
First of all you need a base image with Docker installed – thus for that we have built and made available an Ubuntu 14.04 image with Docker installed. Apart from Docker, to build a fully dynamic and
service discovery aware Docker cluster we needed jq and bridge-utils.
Once this base image is created you will need to make it public and re-usable. In order to do that the image has to be published in VMdepot. When you are about to use an image from VM depot, and create a VM based on that you will need to copy it in your own storage account – note that doing it at first time this can be a slow process (20-25 minutes, copying the 30 GB image).
Now you have an image based on that you can launch your own VMs, and the Docker container inside your VM. While there are a few options to do that, we needed to find a unified way to do so – note that Cloudbreak is a cloud agnostic solution – and we do not want to create init scripts for each and every cloud environment we use. Amazon’s AWS has a feature so called
userdata – an option of passing data to the instance that can be used to perform common automated configuration tasks and even run scripts after the instance starts. You can pass two types of user data to Amazon AWS: shell scripts and cloud-init directives. In order to keep the launch process unified everywhere we are using cloud-init on Azure as well.
You can use/start Docker with different networking setup – using a bridged network or using the host network. You can check the init scripts in our GitHub repository.
1 2 3 4 5 6
Note: for cloud based clusters we are giving up on the bridged based network – mostly due to Azure’s networking limitations – and will use the
net=host solution in the next release. The bridged network will still be a supported solution, though we are using it mostly with bare metal or multi container/host solutions.
Azure has (comparing with Amazon’s AWS or Google’s Cloud compute) an
uncommon network setup and supports limited flexibility – in order to overcome these, and still have a dynamic Hadoop cluster different scenarios / use cases requires different Docker networking – that is quite a large undocumented topic which we will cover in our next blog posts – in particular the issues, differences and solutions to use Docker on different cloud providers. While we have briefly talked about Serf in the Cloudbreak documentation, we will enter in deep technical details in one of our next posts as well. Should you be interested in these, make sure you follow us on LinkedIn, Twitter or Facebook for updates.
SequenceIQ’s Azure REST API – open sourced
At SequenceIQ we always automate everything – and in order to launch VM instances, configure networks, start containers, etc we needed a REST client which we can use it from our JAVA and Scala codebase. Since the Microsoft API is XML based – yo, it’s 2014 – we have created and open sourced a Groovy based Azure REST API – wrapping the XML calls into a nice, easy to use and clean REST API. Feel free to use it – it’s open sourced under an Apache 2 license. Note that Cloudbreak does not store your Azure user credential – whereas with the defulat Azue CLI that would have been possible – the only thing we need from your side to work is your subscription id. The process is documented here: http://sequenceiq.com/cloudbreak/#accounts.
Metadata service for Azure
The another nice feature we have created for Azure VMs is a
metadata service. While a service as such does exists on Amazon’s AWS it’s missing from Microsoft Azure – note that our Cloudbreak solution is a cloud agnostic one, and we always strive to use identical solution on all cloud providers. The instance metadata is data about your instance that you can use to configure or manage the running instances – and available via a REST call. We have developed a service as such for Azure – AzureMetadataSetup. As you can see we collect the metadata, and make it available under a
unique hash for each cluster by calling the following resource:
1 2 3 4 5 6 7 8 9 10
This service is used in a few cases – for example to learn different network setups as the hosts are using different network options than the Docker containers.
As usual for us – being committed to 100% open source – we are always open sourcing everything thus you can get the details on our GitHub repository. Should you have any questions feel free to engage with us on our blog or follow us on LinkedIn, Twitter or Facebook.