Re: Spark integration with HDFS and Cassandra simultaneously

Patrick Wendell Sat, 26 Oct 2013 22:16:29 -0700

Hey Rohit,

A single SparkContext can be used to read and write files of different
formats, including HDFS or cassandra. For instance you could do this:


rdd1 = sc.textFile(XXX)  // Some text file in HDFS
rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  // Save
into a cassandra file (see Cassandra example)

This is a common pattern when using Spark for ETL between different storage
systems.

- Patrick


On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <[email protected]> wrote:

> Hi Rohit,
>
> We are big users of the Spark Shell - it is used by our analytics team for
> the same purposes that Hive used to be.  The SparkContext which is provided
> at startup I guess would have to be one of HDFS or Cassandra - I take it we
> would then manually create a second context?
>
> Thanks,
>
> Gary
>
>
> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <[email protected]> wrote:
>
>> Hello Gary,
>>
>> This is very easy to do. You can read your data from HDFS using
>> FileInputFormat, transform it to a desired rows and write to Cassandra
>> using ColumnFamilyInputFormat.
>>
>> Our library called Calliope (Apache Licensed),
>> http://tuplejump.github.io/calliope/ can make the task of writing to C*
>> easier.
>>
>>
>> In case you don't want to convert it to rows and keep them as files in
>> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
>> SnackFS can help you. SnackFS will be part of next Calliope release later
>> this month, but we can provide you access if you would like to try it out.
>>
>> Feel free to mail me directly in case you need any assistance.
>>
>>
>> Regards,
>> Rohit
>> founder @ tuplejump
>>
>>
>>
>>
>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <[email protected]>wrote:
>>
>>> We have a use case in which much of our raw data is stored in HDFS
>>> today.  We'd like to write our Spark jobs such that they read/aggregate
>>> data from HDFS and can output to our Cassandra cluster.
>>>
>>> Is there any way of doing this in spark 0.7.3?
>>>
>>
>>
>>
>> --
>>
>> ____________________________
>> www.tuplejump.com
>> *The Data Engineering Platform*
>>
>
>

Re: Spark integration with HDFS and Cassandra simultaneously

Reply via email to