Shark's in-memory code was ported to Spark SQL and is used by default when you run .cache on a SchemaRDD or CACHE TABLE.
I'd also look at parquet which is more efficient and handles nested data better. On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf <nightwolf...@gmail.com> wrote: > Hi all, > > I'd like to build/use column oriented RDDs in some of my Spark code. A > normal Spark RDD is stored as row oriented object if I understand > correctly. > > I'd like to leverage some of the advantages of a columnar memory format. > Shark (used to) and SparkSQL uses a columnar storage format using primitive > arrays for each column. > > I'd be interested to know more about this approach and how I could build > my own custom columnar-oriented RDD which I can use outside of Spark SQL. > > Could anyone give me some pointers on where to look to do something like > this, either from scratch or using whats there in the SparkSQL libs or > elsewhere. I know Evan Chan in a presentation made mention of building a > custom RDD of column-oriented blocks of data. > > Cheers, > ~N >