i wrote a proof of concept to automatically store any RDD of tuples or case
classes in columar format using arrays (and strongly typed, so you get the
benefit of primitive arrays). see:
https://github.com/tresata/spark-columnar

On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust <[email protected]>
wrote:

> Shark's in-memory code was ported to Spark SQL and is used by default when
> you run .cache on a SchemaRDD or CACHE TABLE.
>
> I'd also look at parquet which is more efficient and handles nested data
> better.
>
> On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf <[email protected]>
> wrote:
>
>> Hi all,
>>
>> I'd like to build/use column oriented RDDs in some of my Spark code. A
>> normal Spark RDD is stored as row oriented object if I understand
>> correctly.
>>
>> I'd like to leverage some of the advantages of a columnar memory format.
>> Shark (used to) and SparkSQL uses a columnar storage format using primitive
>> arrays for each column.
>>
>> I'd be interested to know more about this approach and how I could build
>> my own custom columnar-oriented RDD which I can use outside of Spark SQL.
>>
>> Could anyone give me some pointers on where to look to do something like
>> this, either from scratch or using whats there in the SparkSQL libs or
>> elsewhere. I know Evan Chan in a presentation made mention of building a
>> custom RDD of column-oriented blocks of data.
>>
>> Cheers,
>> ~N
>>
>
>

Reply via email to