Re: Columnar-Oriented RDDs

2015-03-01 Thread Night Wolf
Thanks for the comments guys. Parquet is awesome. My question with using Parquet for on disk storage - how should I load that into memory as a spark RDD and cache it and keep it in a columnar format? I know I can use Spark SQL with parquet which is awesome. But as soon as I step out of SQL we

Re: Columnar-Oriented RDDs

2015-03-01 Thread Koert Kuipers
Hey, I do not have any statistics. I just wanted to show it can be done but left it at that. The memory usage should be predictable: the benefit comes from using arrays for primitive types. Accessing the data row-wise means re-assembling the rows from the columnar data, which i have not tried to

Columnar-Oriented RDDs

2015-02-13 Thread Night Wolf
Hi all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using

Re: Columnar-Oriented RDDs

2015-02-13 Thread Michael Armbrust
Shark's in-memory code was ported to Spark SQL and is used by default when you run .cache on a SchemaRDD or CACHE TABLE. I'd also look at parquet which is more efficient and handles nested data better. On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com wrote: Hi all, I'd like

Re: Columnar-Oriented RDDs

2015-02-13 Thread Koert Kuipers
i wrote a proof of concept to automatically store any RDD of tuples or case classes in columar format using arrays (and strongly typed, so you get the benefit of primitive arrays). see: https://github.com/tresata/spark-columnar On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust