Automated Install of HDP 2.1 for Hadoop on Windows
April 24th, 2014
Hortonworks Data Platform 2.1 for Windows is the 100% open source data management platform based on Apache Hadoop and available for the Microsoft Windows Server platform. I have built a helper tool that automates the process of deploying a multi-node Hadoop cluster – utilizing the MSI available in HDP 2.1 for Windows.
A New Python Client for Impala
April 30, 2014
The new Python client for Impala will bring smiles to Pythonistas!
As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.
To remedy this, we have written the (unpredictably named) impyla Python package (not officially supported). Impyla communicates with Impala using the same standard Impala protocol as the ODBC/JDBC drivers. This RPC library is then wrapped by the commonly-used database API specified in PEP 249 (“DB API v2.0”). Below you’ll find a quick tour of its functionality. Note this is still a 0.x.y release: the PEP 249 client is beta, while the sklearn and udfsubmodules are pre-alpha.
How Impala Brings Real-Time, Big Data Analytics to Digital Reasoning’s Users
March 28, 2014
At the beginning of each release cycle, engineers at Digital Reasoning are given time to explore the latest in Big Data technologies, examining how the frequently changing landscape might be best adapted to serve our mission. As we sat down in the early stages of planning for Synthesys 3.8 one of the biggest issues we faced involved reconciling the tradeoff between flexibility and performance. How can users quickly and easily retrieve knowledge from Synthesys without being tied to one strict data model?
Introduction to Apache Falcon: Data Governance for Hadoop
March 25th, 2014
Recently HortownWorks released their next version of HDP which includes Apache Falcon. Apache Falcon is a data governance engine that defines, schedules, and monitors data management policies. Falcon allows Hadoop administrators to centrally define their data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie. Falcon arised from Inmobi.
Its going to make my current work easier. Hope So.
Apache Storm and Hadoop
March 24th, 2014
In February 2014, the Apache Storm community released Storm version 0.9.1. Storm is a distributed, fault-tolerant, and high-performance real-time computation system that provides strong guarantees on the processing of data. Hortonworks is already supporting customers using this important project today.
Many organizations have already used Storm, including our partner Yahoo! This version of Apache Storm (version 0.9.1) is:
• Highly scalable. Like Hadoop, Storm scales linearly
• Fault-tolerant. Automatically reassigns tasks if a node fails
Eventhough Storm provides a lot of functionalities and integration in HDP, I would like to keep my fingers crossed because of the similar offering as well as open source projects.
Hadoop GroupMapping – LDAP Integration
March 21st, 2014
LDAP provides a central source for maintaining users and groups within an enterprise. There are two ways to use LDAP groups within Hadoop. The first is to use OS level configuration to read LDAP groups. The second is to explicitly configure Hadoop to use LDAP-based group mapping.
Here is an overview of steps to configure Hadoop explicitly to use groups stored in LDAP.
• Create Hadoop service accounts in LDAP
• Shutdown HDFS NameNode & YARN ResourceManager
• Modify core-site.xml to point to LDAP for group mapping
• Re-start HDFS NameNode & YARN ResourceManager
• Verify LDAP based group mapping
Prerequisites: Access to LDAP and the connection details are available.…
Index-Level Security Comes to Cloudera Search
So many companies are already working on Security in Hadoop. Now Cloudera extended their component “Sentry” to provide authentication to their serach offering “Cloudera Search”.
How-to: Use Parquet with Impala, Hive, Pig, and MapReduce
March 21, 2014
An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:
GitHub Gets with Apache Cassandra for Fault-tolerant, Consumer Facing Reports
March 27, 2014
A Blog from Employee of GitHub
My company, GitHub, helps software developers work better together; we host source code repositories and provide tools for builders to collaborate on their projects. I work on the analytics team at GitHub. My role includes the care and feeding of our data analysis pipeline and the user-facing features we build on top of it.
The K’s of Data Mining – Great Things Come in Pairs
March 27, 2014
Below is the link for a blog written by Dr.Kirk Borne, a Transdisciplinary Data Scientist and an Astrophysicist.Big Data is all about ‘V’s. Here he explains abouts data mining and concepts starting with letter ‘K’.
The Forrester Wave for Hadoop market
Report for Q1 2014 can be found below.I would not rather comment on their position.I left it for you to make your own decisions.
NoSQL Family Tree
Monday, 24 March 2014
Even if it includes just a handful of NoSQL databases, it’s still a nice visualization.
Announcing HDP 2.1 Tech Preview Component: Apache Spark
May 1st, 2014
Hadoop 2 and its YARN-based architecture has increased the interest in new engines to be run on Hadoop and one such workload is in-memory computing for machine learning and data science use cases. Apache Spark has emerged as an attractive option for this type of processing and today, we announce availability of our HDP 2.1 Tech Preview Component of Apache Spark. This is a key addition to the platform and brings another workload supported by YARN on HDP.
There has been a marked increase in interest among data scientists and enthusiasts for Apache Spark as they explore new ways to perform their very unique yet complex task in Hadoop. Our customers are investigating this technology as Spark allows key resources to effectively and simply implement iterative algorithms for advanced analytics such as clustering and classification of datasets. It provides three key value points to developers,
• in-memory compute for iterative workloads,
• a simplified programming model in Scala,
• and machine learning libraries to simply programming.
Automated Install of HDP 2.1 for Hadoop on Windows