New in CDH 5.1: HDFS Read Caching

by Colin McCabe and Andrew Wang
August 11, 2014

Applications using HDFS, such as Impala, will be able to read data up to 59x faster thanks to this new feature.

Server memory capacity and bandwidth have increased dramatically over the last few years. Beefier servers make in-memory computation quite attractive, since a lot of interesting data sets can fit into cluster memory, and memory is orders of magnitude faster than disk.


User Defined Functions in Cassandra 3.0
August 12, 2014
By Planet Cassandra
User Defined Functions in Cassandra 3.0
Release 3.0 of Apache Cassandra will bring a new cool feature called User Defined Functions (UDF). Yes – users can write code that is executed inside Cassandra daemons.
UDFs are implemented by stateless code. By stateless I mean that a UDF implementation has just its input arguments to rely on. There is nothing like a shared state or execution context.
The programming language is your own choice. Java source support is built-in and even script languages that have JSR-223 support (javax.script) can be used.


New features in the DataStax C# Driver 2.1
August 12, 2014

By Planet Cassandra
Following the announcement of the Python driver 2.1, we’re pleased to announce today the release of the version 2.1 of the DataStax C# driver, that brings support for the new features in Cassandra 2.1 while maintaining compatibility with previous versions of Cassandra (1.2+).

Read more
Spark for Data Science: A Case Study
May 1st, 2014

I’m a pretty heavy Unix user and I tend to prefer doing things the Unix Way™, which is to say, composing many small command line oriented utilities. With composability comes power and with specialization comes simplicity. Although, sometimes if two utilities are used all the time, sometimes it makes sense for either:
• A utility that specializes in a very common use-case
• One utility to provide basic functionality from another utility
For example, one thing that I find myself doing a lot of is searching a directory recursively for files that contain an expression:
Despite the fact that you can do this, specialized utilities, such as ack have come up to simplify this style of querying.…
Read more
Node of the Rings: Fellowship of the Clusters (Or: Understanding How Cassandra Stores Data)Node of the Rings: Fellowship of the Clusters (Or: Understanding How Cassandra Stores Data)
May 1st, 2014
“Node of the Rings: Fellowship of the Clusters (Or: Understanding How Cassandra Stores Data)” was created by Michael Kjellman, as part of Hakka Labs’ Cassandra Week.
Read more

Implementing the Kiji Data Model to Build Real-Time, Big Data Apps on Cassandra
May 1, 2014
In this talk, Clint Kelly (Technical Staff, WibiData) discusses:

- The Kiji architecture and data model
- Implementing the Kiji data model in Cassandra using the Java driver and CQL3
- Integrating Cassandra with Hadoop 2.x
- Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously)
- Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users

The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications.

This talk was given at Cassandra Day Silicon Valley. Be sure to check out our other

Read more

If You Need ACID, C* is Not the Best NoSQL Option for You
May 1st, 2014
Need ACID?
Some problems beg to be solved with a relational database and schema. If a significant amount of your application logic requires transactions using ACID (Atomicity Consistency Isolation Durability), which is common with most relational databases, C* might not be a good fit.
As Cassandra continues to mature, new features continue to be implemented that help provide solutions for application logic that needs either ACID transactions, better solutions to limitations defined by the CAP theorem, or features designed to reduce the learning curve of C*. For example, Cassandra’s distributed architecture uses the concept of eventual consistency, which is an implementation model that guarantees that “eventually” all reads from various nodes in the cluster will return the most recent and correct result. This consistency model might be an issue however in some application logic. Introduced with Cassandra 2.0 are “Lightweight Transactions” which make C* a better fit where you require a given transaction has ACID properties (like MySQL or most RDMBS’).

Read more
Betting the Farm on MongoDB
MAY 1 2014
This is a guest post by Jon Dokulil, VP of Engineering at Hudl. Hudl’s CTO, Brian Kaiser, will be speaking at MongoDB World about migrating from SQL Server to MongoDB

Hudl helps coaches win. We give sports teams from peewee to the pros online tools to make working with and analyzing video easy. Today we store well over 600 million video clips in MongoDB spread across seven shards. Our clips dataset has grown to over 350GB of data with over 70GB of indexes. From our first year of a dozen beta high schools we’ve grown to service the video needs of over 50,000 sports teams worldwide.

Read more

this week in elasticsearch
April 30, 2014
Welcome to This Week in Elasticsearch. In this roundup, we try to inform you about the latest and greatest changes in Elasticsearch. We cover what happened in the GitHub repositories, as well as many Elasticsearch events happening worldwide, and give you a small peek into the future of the project.

elasticsearch core

Field data: Improved circuit breaker error messages to include name of the field that caused a circuit break (#5718, master and 1.x)
Field data: Code cleanup, removed unused or almost unused methods (#5874, master and 1.x)
Field data: Use segment ordinals as global ordinals when possible (#5873)
Field data: Made ordinals start from 0 (#5871, master and 1.x)
Field data: Improved global ordinals on low cardinality fields (#5854, master and 1.x)
Field data: Provided better error message if field has no field data type (#5979, master and 1.x)
Lucene: Enabled turning on IndexWriter‘s InfoStream (#5891, master and 1.x)
Lucene: Upgraded to Lucene 4.8 (#5932, master and 1.x)
Document versioning: Versioned get operations tests for version equality in all version types (#5929, master and 1.x)
Read more

Community Makes Push Toward Beta at Productive Apache Drill Hackathon

April 29, 2014
MapR recently hosted the first Apache Drill hackathon, with nearly forty people in attendance who helped push Drill toward its first beta release. It was great to see people from companies such as Visa, Cisco, LinkedIn and Hortonworks come together to harden and enhance the Apache Drill project.

The hackathon participants worked on many different aspects of Apache Drill. Over the next few weeks, these features will be incorporated into mainline. Here’s a preview of what we worked on, coming soon to a master near you

Read more

Bringing the Best of Apache Hive 0.13 to CDH Users
April 28, 2014

More than 300 bug fixes and stable features in Apache Hive 0.13 have already been backported into CDH 5.0.0.
Last week, the Hive community voted to release Hive 0.13. We’re excited about the continued efforts and progress in the project and the latest release — congratulations to all contributors involved!
Furthermore, thanks to continual feedback from customers about their needs, we were able to test and make more than 300 Hive 0.13 fixes and stable features generally available via CDH 5.0.0, which we released last month. Thus, Cloudera customers can confidently take advantage of them in production right now, including:
Read more

MapR Integrates the Complete Apache Spark Stack
April 10, 2014
This was originally posted on the Databricks blog.

With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an interview with Stefan Groschupf, CEO of Datameer.
Read more

Pages: 1 2 3 4 5 6 7