Multinode cluster with Ambari 1.7.0 - in Docker Janos Matyas

Two days ago the latest version of Ambari (1.7.0) has been released and now is time for us to release our automated process to deploy Hadoop clusters with Ambari in Docker containers.

The release contains lots of new features (follow this link) – we will highlight a few we consider important for us:

  • Ambari Views – a systematic way to plug-in UI capabilities to surface custom visualization, management and monitoring features in Ambari Web.
  • Extended/new stack definitions – support for Hortonworks HDP and Apache Bigtop stacks
  • Apache Slider integration – ease deployments of existing applications into a YARN cluster

As usual we have dockerized the whole Ambari 1.7.0 thus you can take the container and provision your arbitrary size Hadoop cluster.

Get the Docker container

In case you don’t have Docker browse among our previous posts – we have a few posts about howto’s, examples and best practices in general for Docker and in particular about how to run the full Hadoop stack on Docker.

1
docker pull sequenceiq/ambari:1.7.0
Read on →

Running Hadoop 2.6.0 in Docker containers Janos Matyas

Yesterday the Hadoop community has released the 2.6.0 version of Hadoop – the 4th major release of this year – and one which contains quite a few new and interesting features:

  • Rolling upgrades – the holly grail for enterprises to switch Hadoop versions
  • Long running services in YARN
  • Heterogeneous storage in HDFS

These were the most popular features, though beside these there were quite a few extremely important ones – at least for us and our our Periscope project. As you might be aware we are working on an SLA policy based autoscaling API for Apache YARN and we were closely following/been involved or contributed to these JIRA’s below:

  • YARN-2248 – CS changes for moving apps between queues
  • YARN-1051 – YARN Admission Control/Planner

These tasks/subtasks (beside a few others) are all coming with the new major release and opening up brand new opportunities to make Hadoop YARN a more dynamic environment. Considering these and the Apache Slider project it’s pretty clear to see that exciting times are coming.

We have combined all these above with Docker, our open source projects – Cloudbreak and Periscope – and we are leveraging these new innovations, so stay tuned and get the code or our Docker containers to start with.

In the meanwhile (as usuall) we have released our Hadoop 2.6.0 container to ease your quick start with Hadoop.

DIY – Build the image

In case you’d like to try directly from the Dockerfile you can build the image as:

1
docker build  -t sequenceiq/hadoop-docker:2.6.0 .
Read on →

Periscope: time based autoscaling Krisztian Horvath

Periscope allows you to configure SLA policies for your cluster and scale up or down on demand. You are able to set alarms and notifications for different metrics like pending containers, lost nodes or memory usage, etc . Recently we got a request to scale based on time interval. What does this mean? It means that you can tell Periscope to shrink your cluster down to arbitrary number of nodes after work hours or at weekends and grow it back by the time people starts to work. We thought it would make a really useful feature so we quickly implemented it and made available. You can learn more about the Periscope API here.

Cost efficiency

In this example we’ll configure Pericope to downscale at 7PM and upscale at 8AM from Monday to Friday:

Just to make things easier let’s assume that our cluster is homogeneous. On AWS a c3.xlarge instance costs $0.210 per hour. Now let’s do the math:

  • 24 x 0.21 x 100 = $504
  • (11 x 0.21 x 100) + (13 x 0.21 x 10) = $260

In a month we can save $7560 scaling from 100 to 10 and back – and the weekends are not even counted.

Read on →


YARN containers as Docker containers in Docker Janos Matyas

The new Hadoop 2.6 release is almost here with an impressive set of new features and collaboration from the community – including SequenceIQ as well. It’s not a new information that we use Docker quite a lot – and have containerized the full Hadoop ecosystem, and the 2.6 release will contain a new feature: YARN-1964.

YARN containers as .. Docker containers

Introduction of YARN has revolutionized Hadoop – extending it with a new resource management, opening up Hadoop to different workloads, etc. – it’s all history and we know it. With the emergence and wide adoption of Docker these days we are part of another interesting times again. Hadoop 2.6 will introduce (though in alpha) an analogy of YARN containers as Docker containers.

Just to remember, a container is the resource allocation which is the successful result of the ResourceManager granting a specific ResourceRequest. A Container grants rights to an application to use a specific amount of resources (memory, cpu etc.) on a specific host, and isolates it from other containers. Sounds familiar – well, among few others this is what exactly Docker does – with the additional benefit of packaging and shipping applications the easy way.

Read on →

Building the data lake in the cloud - Part2 Marton Sereg

Few weeks ago we had a post about building a data lake in the cloud using a cloud based object storage as the primary file system. In this post we’d like to move forward and show you how to create an always on persistent datalake with Cloudbreak and create ephemeral clusters which can be scaled up and down based on configured SLA policies using Periscope.

Just as a quick reminder – both are open source projects under Apache2 license and the documentation and code is available following these links below.

Name Description Documentation GitHub
Cloudbreak Cloud agnostic Hadoop as a Service http://blog.sequenceiq.com/blog/2014/07/18/announcing-cloudbreak/ https://github.com/sequenceiq/cloudbreak
Periscope SLA policy based autoscaling for Hadoop clusters http://blog.sequenceiq.com/blog/2014/08/27/announcing-periscope/ https://github.com/sequenceiq/periscope

Sample architecture

For the sample use case we will create a datalake on AWS and Google Cloud as well – and use the most popular data warehouse software with an SQL interface – Apache Hive.

Read on →

Extreme OLAP Engine running in Docker Krisztian Horvath

Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

At SequenceIQ we are always interested in the latest emerging technologies, and try to offer those to our customers and the open source community. A few weeks ago eBay Inc. released Kylin as an open source product and made available for the community under an Apache 2 license. Since we share the approach towards open source software we have partnered with them to Dockerize Kylin – and made it extremely easy for people to deploy a Kylin locally or in the cloud, using our Hadoop as a Service API – Cloudbreak.

While there is a pretty good documentation available for Kylin we’d like to give you a really short introduction and overview.

Read on →

New YARN features: Label based scheduling Krisztian Horvath

The release of Hadoop 2.6.0 is upon us thus it’s time to highlight a few upcoming features, especially those which we are building/planning to use in our Hadoop as a Service API – Cloudbreak and our SLA policy based autoscaling API – Periscope.

Recently we explained how the CapacityScheduler and the FairScheduler works and the upcoming release is about to add a few really interesting functionality to them which you should be aware as they might change the way we think about resource scheduling. The first one which we are going to discuss is the label based scheduling although it’s not fully finished, yet. You can track its progress here: YARN-796.

Motivation

Hadoop clusters are usually not fully homogeneous which means that different nodes can have different parameters. For example some nodes have more memory than the others while others have better cpu’s or better network bandwidth. At the moment YARN doesn’t have the ability to segregate nodes in a cluster based on their architectural parameters. Applications which are aware of their resource usages cannot choose which nodes they want to run their containers on. Labels are about to solve this problem. Administrators will have the ability to mark the nodes with different labels like: cpu, memory, network, rackA, rackB so applications can specify where they’d like to run.

Cloud

Things are different in cloud environments as the composition of the Hadoop clusters are more homogeneous. By the nature of cloud it’s easier and more convenient to request nodes with the exact same capabilities. Cloudbreak our Hadoop as a service API will address this problem, by giving the ability to the users to specify their needs. Take one example: on AWS users can launch spot price instances which EC2 can take away any time. Labeling them as spot we can avoid spinning up the ApplicationMasters on those nodes, thus operate safely and re-launch new containers on different nodes in case it happens. Furthermore Periscope with its autoscaling capabilities will be able to scale out with nodes that are marked with cpu.

Read on →

Securing Cloudbreak with OAuth2 - part 2 Marton Sereg

A few weeks ago we’ve published a blog post about securing our Cloudbreak infrastructure with OAuth2. We’ve discussed how we were setting up and configuring a new UAA OAuth2 identity server with Docker but we haven’t detailed how to use this identity server in client applications. And that’s exactly what we’ll do now: we’ll show some code examples about how to obtain tokens from different clients and how to check these tokens in resource servers.

We’re using almost every type of the OAuth2 flows in our infrastructure: Cloudbreak and Periscope act as resource servers while Uluwatu and Cloudbreak shell for example are clients for these APIs.

Obtaining an access token

The main goal of an OAuth2 flow is to obtain an access token for the resource owner that can be used to access a resource server later. There are multiple common flows depending on the client type, we’ll have examples for three of them now: implicit, authorization code and client credentials. If you’re not familiar with the roles and expressions that take part in the OAuth2 flows I suggest to check out some “Getting started” resources first before going forward with this post.

Implicit flow

This is not the most common flow with OAuth2 but it is the most simple one because only one request should be made to the identity server and the token will arrive directly in the response. Two different types of this flow is supported by UAA. One for browser-based applications and one for those scenarios when there is no browser interaction (e.g.: CLIs). The common part of these scenarios is that it would be useless to have a client secret because it couldn’t be kept as a secret.

We are using the implicit flow with credentials in the Cloudbreak Shell. When using the shell you must provide your SequenceIQ credentials as environment variables and the shell uses those to obtain an access token. Cloudbreak shell is written in Java but let’s see a basic curl example instead – it does exactly the same as the Java code. (If you’re still eager you can check out the code here)

1
2
3
curl -iX POST -H "accept: application/x-www-form-urlencoded"  \
 -d 'credentials={"username":"admin","password":"periscope"}' \
 "http://localhost:8080/oauth/authorize?response_type=token&client_id=cli&scope.0=openid&redirect_uri=http://cli"
Read on →

YARN Timeline Service Laszlo Puskas

As you may know from our earlier blogposts we are continuously monitoring and trying to find out what happens inside our YARN clusters, let it be MapReduce jobs, TEZ DAGs, etc… We’ve analyzed our clusters from various aspects so far; now it’s the time to take a look at the information provided by the built YARN timeline service.

This post is about how to set up a YARN cluster so that the Timeline Server is available and how to configure applications running in the cluster to report information to it. As an example we’ve chosen to run a simple TEZ example. (MapReduce2 also reports to the timeline service)

As a playground we will use a multinode cluster set up on the local machine; alternatively one could do the same on a cluster provisioned with Cloudbreak. Cluster nodes run in Docker containers, YARN / TEZ provisioning and configuration is done with Apache Ambari.

Building a multinode cluster

To build a multinode cluster we use a set of commodity functions that you can install by running the following in a terminal:

1
curl -Lo .amb j.mp/docker-ambari && . .amb

(The commodity functions use our docker-ambari image: sequenceiq/ambari:1.6.0)

Read on →

Recent Posts