Re: Spark integration with HDFS and Cassandra simultaneously

Patrick Wendell Sat, 26 Oct 2013 22:16:29 -0700

Err - "Hi Gary"!


On Sat, Oct 26, 2013 at 10:14 PM, Patrick Wendell <[email protected]>wrote:

> Hey Rohit,
>
> A single SparkContext can be used to read and write files of different
> formats, including HDFS or cassandra. For instance you could do this:
>
> rdd1 = sc.textFile(XXX)  // Some text file in HDFS
> rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  // Save
> into a cassandra file (see Cassandra example)
>
> This is a common pattern when using Spark for ETL between different
> storage systems.
>
> - Patrick
>
>
> On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <[email protected]>wrote:
>
>> Hi Rohit,
>>
>> We are big users of the Spark Shell - it is used by our analytics team
>> for the same purposes that Hive used to be.  The SparkContext which is
>> provided at startup I guess would have to be one of HDFS or Cassandra - I
>> take it we would then manually create a second context?
>>
>> Thanks,
>>
>> Gary
>>
>>
>> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <[email protected]> wrote:
>>
>>> Hello Gary,
>>>
>>> This is very easy to do. You can read your data from HDFS using
>>> FileInputFormat, transform it to a desired rows and write to Cassandra
>>> using ColumnFamilyInputFormat.
>>>
>>> Our library called Calliope (Apache Licensed),
>>> http://tuplejump.github.io/calliope/ can make the task of writing to C*
>>> easier.
>>>
>>>
>>> In case you don't want to convert it to rows and keep them as files in
>>> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
>>> SnackFS can help you. SnackFS will be part of next Calliope release later
>>> this month, but we can provide you access if you would like to try it out.
>>>
>>> Feel free to mail me directly in case you need any assistance.
>>>
>>>
>>> Regards,
>>> Rohit
>>> founder @ tuplejump
>>>
>>>
>>>
>>>
>>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <[email protected]>wrote:
>>>
>>>> We have a use case in which much of our raw data is stored in HDFS
>>>> today.  We'd like to write our Spark jobs such that they read/aggregate
>>>> data from HDFS and can output to our Cassandra cluster.
>>>>
>>>> Is there any way of doing this in spark 0.7.3?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ____________________________
>>> www.tuplejump.com
>>> *The Data Engineering Platform*
>>>
>>
>>
>

Re: Spark integration with HDFS and Cassandra simultaneously

Reply via email to