Author Archives: senthil

Blog Feb14

Hadoop weekly news February 2014


February 3, 2014

BeyeNetwork has a post on an often overlooked differentiator between SQL-on-Hadoop systems and proprietary database systems. In proprietary systems, the query engine, storage format, and file systems are typically tightly coupled. Conversely, many of the SQL-on-Hadoop systems use the Hive Metastore to find data on HDFS and for discovering storage formats. Since the storage layer and storage formats are decoupled from the query engines, it’s easy to switch between e.g. Impala and Hive and Presto.


MapR supports 5 different SQL-on-Hadoop technologies
February 3, 2014

MapR supports 5 different SQL-on-Hadoop technologies as part of their distribution. Given the large number of possible solutions, MapR has posted a detailed comparison to help users choose the best technology for their problem. The page covers Hive, Drill, Impala, Presto, and Shark across categories like SQL completeness and UDF support.

Read more

Cloudera Releases
February 3, 2014

Cloudera has released Cloudera Manager 4.8.1. In addition to resolving several issues, the release adds the ability to distribute Apache Spark (incubating) via parcels on a CDH 4 cluster.

Read more

Deploying a Hadoop Cluster on Amazon EC2 with HDP2
February 3, 2014

In this post, we’ll walk through the process of deploying an Apache Hadoop 2 cluster on the EC2 cloud service offered by Amazon Web Services (AWS), using Hortonworks Data Platform
Both EC2 and HDP offer many knobs and buttons to cater to your specific, performance, security, cost, data size, data protection and other requirements. I will not discuss most of these options in this blog as the goal is to walk through one particular path of deployment to get started.

Elasticsearch 0.90.11 and 1.0.0.RC2 released
February 3, 2014

We are happy to announce the release of Elasticsearch 0.90.11 and Elasticsearch 1.0.0.RC2, both of which are based on Lucene 4.6.1.

Read more

Spark is Now Generally Available for Cloudera Enterprise
February 3, 2014

Cloudera is announcing the general availability of support for Spark, bringing interactive machine learning and stream processing to enterprise data hubs.

Read more

Blog Jan14

Altiscale is Now Available to Everyone

January 29th, 2014

This morning we announced General Availability of our Altiscale Data Cloud Hadoop-as-a-Service (HaaS) offering to any business interested in the transformative power of Hadoop –”. In the private beta period leading up to the launch, we’ve come to appreciate more than ever the important role a Hadoop operations team can play in helping you use Hadoop more effectively.

Read more


Keen IO’s High-Performance Analytics API Powered by Apache Cassandra
January 20th, 2014

Apache Cassandra at Keen IO

Cassandra is the event data store in the most recent version of our architecture. Right now, Cassandra is running in production and supporting nearly all of our biggest customers. The real-time ingestion side of the API collects JSON and stores it in Cassandra. We don’t store the data as raw JSON, but convert it to a much more efficient format that leverages Cassandra’s columnar orientation. We serve ad-hoc queries directly from this optimized format and are able to perform very low latency queries (seconds) over very large datasets (gigabytes).

Read more

HBase 0.96: HBase on Windows, and Improvements to MTTR

January 23rd, 2014

I recently sat down with Devaraj Das and Carter Shanklin to discuss the dramatic improvements delivered in Apache HBase version 0.96 included in HDP 2.0.

Now HBase runs on Windows and (whether on Linux or Windows) it recovers from failures much more quickly, with dramatic improvements in mean time to recovery (MTTR).

Devaraj is one of the original architects of Apache Hadoop and Carter is the Hortonworks product manager focused on HBase. Together, they explain their collaboration with Microsoft to bring HBase to HDP 2.0 for Windows.

Read more

Google Cloud Storage Connector For Hadoop Launches As Data Analytics Becomes A Priority For Cloud Providers
Jan 14, 2014
Google Cloud Storage has long had the ability to run Hadoop so developers can do advanced analytics on its distributed computing platform. Today, Google is attempting to simplify this process with a new connector that the company says makes it easier to run Hadoop on the Google Cloud Platform.
Read more

Twitter Summingbird 0.3.2
26 January 2014
Twitter Summingbird 0.3.2 was released. Summingbird is a framework for supporting hybrid streaming and batch computation (e.g. online with Apache Storm and offline with MapReduce). This release includes bug fixes and some new features.
Read more

Introducing WSJ’s ‘Billion-Dollar Startup Club’ Interactive
January 23, 2014
For private tech startups, notching a billion-dollar valuation used to signify entry into an exclusive club. Now it’s more like a party.
Some three-dozen companies make up the Billion-Dollar Startup Club, a new interactive online chart of the most valuable venture capital-backed companies created by The Wall Street Journal in conjunction with Dow Jones VentureSource. The Journal will update the list as companies move up, down, on and off.

Read more