Thanks for the comments guys.
Parquet is awesome. My question with using Parquet for on disk storage -
how should I load that into memory as a spark RDD and cache it and keep it
in a columnar format?
I know I can use Spark SQL with parquet which is awesome. But as soon as I
step out of SQL we
Hey,
I do not have any statistics. I just wanted to show it can be done but left
it at that. The memory usage should be predictable: the benefit comes from
using arrays for primitive types. Accessing the data row-wise means
re-assembling the rows from the columnar data, which i have not tried to
Hi all,
I'd like to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used to) and SparkSQL uses a columnar storage format using
Shark's in-memory code was ported to Spark SQL and is used by default when
you run .cache on a SchemaRDD or CACHE TABLE.
I'd also look at parquet which is more efficient and handles nested data
better.
On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com wrote:
Hi all,
I'd like
i wrote a proof of concept to automatically store any RDD of tuples or case
classes in columar format using arrays (and strongly typed, so you get the
benefit of primitive arrays). see:
https://github.com/tresata/spark-columnar
On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust