Hey,
I do not have any statistics. I just wanted to show it can be done but left
it at that. The memory usage should be predictable: the benefit comes from
using arrays for primitive types. Accessing the data row-wise means
re-assembling the rows from the columnar data, which i have not tried to
profile or optimize at all yet, but for sure there should be some overhead
compared to an row-oriented RDD.
Also the format relies on compile-time types (which is what allows the
usage of arrays for primitive types).

On Sun, Mar 1, 2015 at 6:33 AM, Night Wolf <nightwolf...@gmail.com> wrote:

> Thanks for the comments guys.
>
> Parquet is awesome. My question with using Parquet for on disk storage -
> how should I load that into memory as a spark RDD and cache it and keep it
> in a columnar format?
>
> I know I can use Spark SQL with parquet which is awesome. But as soon as I
> step out of SQL we have problems as it kinda gets converted back to a row
> oriented format.
>
> @Koert - that looks really exciting. Do you have any statistics on memory
> and scan performance?
>
>
> On Saturday, February 14, 2015, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i wrote a proof of concept to automatically store any RDD of tuples or
>> case classes in columar format using arrays (and strongly typed, so you get
>> the benefit of primitive arrays). see:
>> https://github.com/tresata/spark-columnar
>>
>> On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> Shark's in-memory code was ported to Spark SQL and is used by default
>>> when you run .cache on a SchemaRDD or CACHE TABLE.
>>>
>>> I'd also look at parquet which is more efficient and handles nested data
>>> better.
>>>
>>> On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf <nightwolf...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to build/use column oriented RDDs in some of my Spark code. A
>>>> normal Spark RDD is stored as row oriented object if I understand
>>>> correctly.
>>>>
>>>> I'd like to leverage some of the advantages of a columnar memory
>>>> format. Shark (used to) and SparkSQL uses a columnar storage format using
>>>> primitive arrays for each column.
>>>>
>>>> I'd be interested to know more about this approach and how I could
>>>> build my own custom columnar-oriented RDD which I can use outside of Spark
>>>> SQL.
>>>>
>>>> Could anyone give me some pointers on where to look to do something
>>>> like this, either from scratch or using whats there in the SparkSQL libs or
>>>> elsewhere. I know Evan Chan in a presentation made mention of building a
>>>> custom RDD of column-oriented blocks of data.
>>>>
>>>> Cheers,
>>>> ~N
>>>>
>>>
>>>
>>

Reply via email to