Hey, I do not have any statistics. I just wanted to show it can be done but left it at that. The memory usage should be predictable: the benefit comes from using arrays for primitive types. Accessing the data row-wise means re-assembling the rows from the columnar data, which i have not tried to profile or optimize at all yet, but for sure there should be some overhead compared to an row-oriented RDD. Also the format relies on compile-time types (which is what allows the usage of arrays for primitive types).
On Sun, Mar 1, 2015 at 6:33 AM, Night Wolf <nightwolf...@gmail.com> wrote: > Thanks for the comments guys. > > Parquet is awesome. My question with using Parquet for on disk storage - > how should I load that into memory as a spark RDD and cache it and keep it > in a columnar format? > > I know I can use Spark SQL with parquet which is awesome. But as soon as I > step out of SQL we have problems as it kinda gets converted back to a row > oriented format. > > @Koert - that looks really exciting. Do you have any statistics on memory > and scan performance? > > > On Saturday, February 14, 2015, Koert Kuipers <ko...@tresata.com> wrote: > >> i wrote a proof of concept to automatically store any RDD of tuples or >> case classes in columar format using arrays (and strongly typed, so you get >> the benefit of primitive arrays). see: >> https://github.com/tresata/spark-columnar >> >> On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> Shark's in-memory code was ported to Spark SQL and is used by default >>> when you run .cache on a SchemaRDD or CACHE TABLE. >>> >>> I'd also look at parquet which is more efficient and handles nested data >>> better. >>> >>> On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf <nightwolf...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I'd like to build/use column oriented RDDs in some of my Spark code. A >>>> normal Spark RDD is stored as row oriented object if I understand >>>> correctly. >>>> >>>> I'd like to leverage some of the advantages of a columnar memory >>>> format. Shark (used to) and SparkSQL uses a columnar storage format using >>>> primitive arrays for each column. >>>> >>>> I'd be interested to know more about this approach and how I could >>>> build my own custom columnar-oriented RDD which I can use outside of Spark >>>> SQL. >>>> >>>> Could anyone give me some pointers on where to look to do something >>>> like this, either from scratch or using whats there in the SparkSQL libs or >>>> elsewhere. I know Evan Chan in a presentation made mention of building a >>>> custom RDD of column-oriented blocks of data. >>>> >>>> Cheers, >>>> ~N >>>> >>> >>> >>