Basic Read and Write using Apache Spark & Cassandra

Spark provides a more inclusive framework allowing for multiple analytics processing options including:

a) fast interactive queries

b) streaming analytics

c) graph analytics and

d) machine learning.

Cassandra is a distributed database based on Google BigTable and Amazon’s Dynamo. Like other Big Data databases, it allows for malleable data structures. One of Cassandra’s coolest features is the fact that it scales in a predictable way – every single node on a Cassandra cluster has the same role and works in the same way. No single node becomes a bottleneck for overall cluster performance, and there is no single point of failure. And that is pretty wonderful.

Spark Cassandra Integration is a wonderful combination for many level processing . Quick start is all about Spark Cassandra connectivity.

saprkcassandra

Pre-requisites of Apache Spark-Cassandra Connectivity

apache-cassandra-2.2.3

spark-1.5.1-bin-hadoop2.6

jdk1.7.0_45

cassandra-driver-core-2.1.5.jar

spark-cassandra-connector_2.10-1.5.0-M1.jar


1. To set in .bashrc

envfinal

(Note : This can be set in .bash_profile too)

2. Apache Spark Installation

(Kindly refer our chennaihug.org for Spark Installation)

http://chennaihug.org/knowledgebase/spark-master-and-slaves-single-node-installation/

(Note : Use above mentioned versions)

3. Apache Cassandra Standalone Quick Start

(Kindly refer our chennaihug.org for Cassandra Installation)

http://chennaihug.org/knowledgebase/cassandra-single-node-installation/

(Note : Use above mentioned versions)

4. Steps for Configuration

a. Copy all the apache-cassandra-2.2.3/lib jars to spark-1.5.1-bin-hadoop2.6/lib + cassandra-driver-core-2.1.5.jar (to be downloaded and added)

b. Go to spark-1.5.1-bin-hadoop2.6/conf/

Rename spark-env.sh-template to  spark-env.sh and add the below

sparkenv

c. Start Cassandra and Spark , using jps command, check the running daemons

sp3f

5. Creation of Key space and Table in Cassandra

Create a key space and Table needed for this quick start in Cassandra

sp1f

Here, Patient Dataset is taken as Input to the Table

dataset

use COPY command to load data into Cassandra

sp2f

6. Enter into spark-shell

Move the downloaded “spark-cassandra-connector_2.10-1.5.0-M1.jar” to spark-1.5.1-bin-hadoop2.6

bin/spark-shell –jars spark-cassandra-connector_2.10-1.5.0-M1.jar

Run the below:

6.1 Configuring a new sc

<br />import org.apache.spark.SparkContext<br />import org.apache.spark.SparkContext._<br />import org.apache.spark.SparkConf<br />sc.stop<br />val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")<br />val sc = new SparkContext("local[2]", "test", conf)<br />

6.2 Accessing Cassandra

<br />import com.datastax.spark.connector._<br />val rdd = sc.cassandraTable("patient", "patientdata")<br />println(rdd.first)<br />

6.3 Inserting data in Cassandra

<br />import com.datastax.spark.connector._<br />import com.datastax.spark.connector.cql._<br />val c = CassandraConnector(sc.getConf)<br />c.withSessionDo ( session =&amp;gt; session.execute("insert into patient.patientdata (sno,name,drug,gender,amt)values (11,'john','avil','male','100')"))<br /><br />

Reference images

a. Creating spark context for Cassandra

sp5fsp6f

b. Data Insertion in Cassandra

insert2f

c. Now, check CQLSH for the newly inserted records

insert3

———————————–

Article written by DataDotz Team

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.

Note: DataDotz also provides classroom based Apache Kafka training in Chennai. The Course includes Cassandra , MongoDB, Scala and Apache Spark Training. For more details related to Apache Spark training in Chennai, please visit http://datadotz.com/training/