Hortonworks acquires SequenceIQ Janos Matyas

Today we are extremely excited to announce that SequenceIQ has joined forces with Hortonworks to accelerate our work to simplify the provisioning of Hadoop clusters across any environments. The SequenceIQ technologies will be integrated with the Hortonworks Data Platform and contributed to the Apache open source community later this year.

Our journey started late February, 2014 when we got together in a co-working office space and started to work on a few different projects mainly focusing on a big data pipeline which abstracted the underlying runtimes of MR2, Tez and Spark. Alongside this journey we were provisioning large clusters accross different environments using the full Hadoop stack. As we all have a strong DevOps mindset, we always automate every recurring steps – and we found Docker a perfect fit for our task.

Around spring 2014, things started to speed up and our innovative vision of running the complete Hadoop ecosystem in containers started to gain traction. The Docker containers for Apache Hadoop, Ambari, and Spark quickly became the most popular/downloaded containers on the Docker Hub (Apache Hadoop over 42000, Apache Ambari over 8200, and Apache Spark over 6600 downloads).

This led to a project we call Cloudbreak – an infrastructure agnostic and secure Hadoop as a Service API for multi-tenant clusters. It was the first beta release (July, 2014) when we started to collaborate with Hortonworks on the project – and the Docker container based Hadoop provisioning was presented at the Hadoop summit in San Jose. The reception from the open source community was terrific – and what was even more amazing was that large enterprises started to PoC and deploy Hadoop clusters with Cloudbreak.

As we were focusing mostly on cloud and container-based environments, that raised another idea – elasticity. As a startup we were extremely cost aware, and provisioning on-demand large (few hundred nodes) clusters daily in all major cloud environments (Amazon AWS, Microsoft Azure, Google Cloud Platform and OpenStack) had a significant financial cost. To address this, we started work on a project we call Periscope – to bring SLA policy based autoscaling to Hadoop and provide QoS for your running applications. Same as Cloudbreak, Periscope is built on top of Apache Ambari and Apache YARN – and leverages the latest cutting edge features of these projects.

At this point we’d like to thank Google for seeing the value in our technology and supporting us with $100,000 Google Cloud Platform credits and our investors from Euroventures who understood the value of elastic cloud technologies such as Cloudbreak and Periscope in managing the cost of cloud providers every month.

The Power of Open Source Community

When we started the company, it was very clear that everything we do will be released under an Apache Software License V2. Nevertheless these projects (Cloudbreak and Periscope) would have not been possible without having access to open source technologies such as Apache Ambari. The Apache Ambari community helped us a lot, and our efforts were made easier by having access to the source code and being able to define and take part in the future of project. We became active contributors and ultimately committers into many Apache Software Foundation projects in the Hadoop ecosystem.

We are excited to be joining the Hortonworks team as we continue our work within the Apache open source community to deliver on our founding vision of simplifying and speeding Hadoop adoption.

Thanks to the team and everyone who has helped us in the journey so far. We believe that by bringing this simplified approach of provisioning Hadoop in open source, we can significantly accelerate enterprise adoption of Hadoop even beyond the phenomenal traction we are already seeing.

Stay tuned for more news in the near term.


Periscope - Ambari 2.0 - scale based on any metric Krisztian Horvath

It’s been a while since we discussed Periscope’s scaling capabilities, but it’s time to revisit again as we’re introducing a more generalized way to monitor and scale your cluster. In the first public beta release we relied on 5 different YARN metrics obtained straight from the ResourceManager to allow users to experiment with it and plan their capacity needs ahead. The feedbacks were really promising. Some people started extending the portfolio with new metrics and others asked us to add certain types which suits their use cases the best. In the meanwhile the Ambari community started to work on redesigning the Alert system which the new version of Periscope is going to leverage.

Ambari 2.0 alerts

The next version of Ambari (going to be released soon) will be able to monitor any type of metrics that the full Hadoop ecosystem provides. It’s really powerful since you’ll not only be able to define simple metric alerts but aggregated, service level, host level and script based ones. Let’s jump into it and see how it looks like to define an alert which triggers if the defined root queue’s available memory falls below a certain threshold (basically the available memory in the cluster):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
{
  "AlertDefinition": {
    "cluster_name": "cluster-name",
    "component_name": "RESOURCEMANAGER",
    "description": "This alarm triggers if the free memory falls below a certain threshold. The threshold values are in percent.",
    "enabled": true,
    "ignore_host": false,
    "interval": 1,
    "label": "Allocated memory",
    "name": "allocated_memory",
    "scope": "ANY",
    "service_name": "YARN",
    "source": {
      "jmx": {
        "property_list": [
          "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root/AvailableMB",
          "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root/AllocatedMB"
        ],
        "value": "{0}/({0} + {1}) * 100"
      },
      "reporting": {
        "ok": {
          "text": "Memory available: {0} MB, allocated: {1} MB"
        },
        "warning": {
          "text": "Memory available: {0} MB, allocated: {1} MB",
          "value": 50
        },
        "critical": {
          "text": "Memory available: {0} MB, allocated: {1} MB",
          "value": 35
        },
        "units": "%"
      },
      "type": "METRIC",
      "uri": {
        "http": "",
        "https": "",
        "https_property": "",
        "https_property_value": "HTTPS_ONLY",
        "default_port": 0,
        "high_availability": {
          "alias_key": "",
          "http_pattern": "}}",
          "https_pattern": "}}"
        }
      }
    }
  }
}
Read on →

OpenStack integration with Cloudbreak Attila Kanto

Cloudbreak can provision HDP clusters on different public cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure. Since the last release Cloudbreak supports provision Hadoop on OpenStack as well. OpenStack is probably the most popular open-source cloud computing platform for private clouds. This blogpost explains in a nutshell how the OpenStack integration was done with Cloudbreak – in order to provision Hadoop – but if you are just interested in playing with OpenStack then it is also worth to read as the Set Up Your Own Private Cloud section explains how to install DevStack (Openstack suitable for development purposes) with just a few lines of commands.

Public and Private Clouds

The overly simplified definition of the two deployment models:

  • public cloud consists of services that are usually purchased on-demand and provided off-site over the Internet by cloud provider
  • private cloud is one in which the services and infrastructure are purchased, maintained and managed within the company

Read on →


Cluster extensions with Cloudbreak recipes Marton Sereg

With the help of Cloudbreak it is very easy to provision Hadoop clusters in the cloud from an Apache Ambari blueprint. That’s cool but it is often needed to make some additional changes on the nodes, like putting a JAR file on the Hadoop classpath or run some custom scripts. To help with these kind of situations we are introducing the concept of Cloudbreak recipes. Recipes are basically script extensions to a cluster that run on a set of nodes before or after the Ambari cluster installation.

How does it work?

Since the latest release, Cloudbreak uses Consul for cluster membership instead of Serf so we can make use of Consul’s other features, namely events and the key-value store. It won’t be detailed here how these features of Consul work, a whole post about Consul based clusters is coming soon. Recipes are using one more additional thing: the small plugn project.

The main concept behind this is the following: before the cluster install is started, a recipe-pre-install Consul event is sent to the cluster that triggers the recipe-pre-install hook of the enabled plugins, therefore executing the plugins’ recipe-pre-install script. After the cluster installation is finished the same happens but with the recipe-post-install event and hook. The key-value store is used to signal plugin success or failure – after the plugins finished execution on a node, a new Consul key is added in the format /events/<event-id>/<node-hostname> that contains the exit status. Cloudbreak is able to check the key-value store if the recipe finished successfully or not.

Read on →

Install Apache Spark with Cloudbreak Oliver Szabo

In the previous weeks many of you often asked us how to run our Apache Spark Docker container on a multi node cluster or how to install Spark and use it with Cloudbreak. Cloudbreak uses Ambari (1.7) blueprints to provision multi node HDP clusters (on different cloud providers: AWS, Google Cloud, Azure, Openstack – with Rackspace and HP Helion coming soon).

In this post we’d like to help you with installing Spark on Cloudbreak in a quick and easy way.

First of all you will have to create a cluster using Cloudbreak on your favorite cloud provider – Google Cloud, AWS, Azure or Openstack (check this post) using a simple multi-node-hdfs-yarn blueprint. After your cluster is ready, you can install Apache Spark with the following steps:

Install from the cloud instance

First, you need to enter to one of your cloud instances. Then use the one-liner below:

1
curl -Lo .docker-spark-install j.mp/spark-hdp-docker-install && . .docker-spark-install

After the file is downloaded it will be sourced, then you can use the following command:

1
install-spark ambari-agent install
Read on →

Cloudbreak - new release available Janos Matyas

We are happy to announce that the Release Candidate version of Cloudbreak is almost around the corner. This major release is the last one in the public beta program and contains quite a few new features and architectural changes.

All theses new major features will be covered in the coming weeks on our blog, but in the meantime let us do a quick skim through.

Accounts

We have introduced the concept of accounts – after a user registers/signs in the first time will have the option to invite other people in the account. Being the administrator of the account, will have the option to activate, deactive and give admin rights for all the invited users.

Users can share resources (such as: cloud credentials, templates, blueprints, clusters) within the account by making it public in account but at the same time can create his own private resources as well. As you might be already aware, we use OAuth2 to make all these possible.

Usage explorer

We have built a unified (accross all cloud providers) usage explorer tool, where you can drill down into details to learn your (or in your account if you have admin rights) usage history. You can filter by date, users, cloud providers, region, etc – and generate a consolidated table/chart overview.

Heterogenous clusters

This was a feature many have asked – and we are happy to deliver it. Up till now all the nodes in your YARN clusters were built on the same cloud instance types. While this was an easy an convenient way to build a cluster (as far as we are aware all the Hadoop as a Service providers are doing it this way) back in the MR1 era, times changed now and with the emergence of YARN different workloads are running within a cluster.

While for example Spark jobs require a high memory instance a legacy MR2 code might require a high CPU instance, whereas a HBase RegionServer likes better a high I/O throughput one.

At SequenceIQ we have quickly realized this and the new release allows you to apply different stack templates to all these YARN services/components. We do the heavy lifting for you in the background – the only thing you will have to do is to associate stack templates to Ambari hostgroups.

This is a major step forward when you are using and running different workloads on your YARN cluster – and not just saving on costs but at the same time increasing your cluster throughput.

Read on →

Apache Spark 1.2.0 on Docker Janos Matyas

In this current post we’d like to help you to start with the latest - 1.2.0 Spark release in minutes – using Docker. Though we have released and pushed the container between the holidays into the official Docker repository, we were still due with the post. Here are the details …

Docker and Spark are two technologies which are very hyped these days. At SequenceIQ we use both quite a lot, thus we put together a Docker container and sharing it with the community.

The container’s code is available in our GitHub repository.

Pull the image from Docker Repository

We suggest to always pull the container from the official Docker repository – as this is always maintained and supported by us.

1
docker pull sequenceiq/spark:1.2.0
Read on →

Docker containers as Apache YARN containers Attila Kanto

The Hadoop 2.6 release contains a new feature that allows to launch Docker containers directly as YARN containers. Basically this solution let the developers package their applications and all of the dependencies into a Docker container in order to provide a consistent environment for execution and also provides isolation from other applications or softwares installed on host.

Configuration

To launch YARN containers as Docker containers the DockerContainerExecutor and the Docker client needs to be set up in the yarn-site.xml configuration file:

1
2
3
4
5
6
7
8
9
<property>
  <name>yarn.nodemanager.container-executor.class</name>
  <value>org.apache.hadoop.yarn.server.nodemanager.DockerContainerExecutor</value>
</property>

<property>
  <name>yarn.nodemanager.docker-container-executor.exec-name</name>
  <value>/usr/local/bin/docker</value>
</property>

As the documentations states the DockerContainerExecutor requires the Docker daemon to be running on the NodeManagers and the Docker client must be also available, but at SequenceIQ we have already packaged and running the whole Hadoop ecosystem into Docker containers and therefore we already have a Docker daemon and Docker client, the only problem is that they are outside of our Hadoop container and therefore the NodeManager or any other process running inside the container does not have access to them. In one of our earlier post we have considered to run Docker daemon inside Docker, but instead of running Docker in Docker it is much more simpler just to reuse the Docker daemon and Docker client what was used for launching the SequenceIQ containers.

Read on →

New Cloudbreak release - support for HDP 2.2 Krisztian Horvath

The last two weeks were pretty busy for us – we have Dockerized the new release of Ambari (1.7.0), integrated Periscope with Cloudbreak and just now we are announcing a new Cloudbreak release which uses Ambari 1.7.0 and has full support for Hortonworks HDP 2.2 and Apache Bigtop stacks. But first – since this has been asked many times – see a short movie about Cloudbreak and Periscope in action.

On-demand Hadoop cluster with autoscaling

Read on →

Cloudbreak welcomes Periscope Richard Doktorics

Today we have pushed out a new release of Cloudbreak – our Docker container based and cloud agnostic Hadoop as a Service solution – containing a few major changes. While there are many significant changes (both functional and architectural) in this blog post we’d like to describe one of the most expected one – the autoscaling of Hadoop clusters.

Just to quickly recap, Cloudbreak allows you to provision clusters – full stacks – in all major cloud providers using a unified API, UI or CLI/shell. Currently we support provisioning of clusters in AWS, Google Cloud, Azure and OpenStack (in private beta) – new cloud providers can be added quite easily (as everything runs in Docker) using our SDK.

Periscope allows you to configure SLA policies for your Hadoop cluster and scale up or down on demand. You are able to set alarms and notifications for different metrics like pending containers, lost nodes or memory usage, etc and set SLA scaling policies based on these alarms.

Today’s release made available the integration between the two projects (they work independently as well) and allows subscribers to enable autoscaling for their already deployed or newly created Hadoop cluster.

We would like to guide you through the UI and help you to set up an autoscaling Hadoop cluster.

Read on →

Recent Posts