Re: Support for local disk columnar storage for DataFrames

2015-11-20 Thread Cristian O
Raised this for checkpointing, hopefully it gets some priority as it's very useful and relatively straightforward to implement ? https://issues.apache.org/jira/browse/SPARK-11879 On 18 November 2015 at 16:31, Cristian O wrote: > Hi, > > While these OSS efforts

Re: Support for local disk columnar storage for DataFrames

2015-11-18 Thread Cristian O
Hi, While these OSS efforts are interesting, they're for now quite unproven. Personally would be much more interested in seeing Spark incrementally moving towards supporting updating DataFrames on various storage substrates, and first of all locally, perhaps as an extension of cached DataFrames.

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Mark Hamstra
FiloDB is also closely reated. https://github.com/tuplejump/FiloDB On Mon, Nov 16, 2015 at 12:24 AM, Nick Pentreath wrote: > Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop > input/output format support: >

Re: Support for local disk columnar storage for DataFrames

2015-11-15 Thread Reynold Xin
This (updates) is something we are going to think about in the next release or two. On Thu, Nov 12, 2015 at 8:57 AM, Cristian O wrote: > Sorry, apparently only replied to Reynold, meant to copy the list as well, > so I'm self replying and taking the opportunity

Re: Support for local disk columnar storage for DataFrames

2015-11-12 Thread Andrew Duffy
Relevant link: http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files On Wed, Nov 11, 2015 at 7:31 PM, Reynold Xin wrote: > Thanks for the email. Can you explain what the difference is between this > and existing formats such as Parquet/ORC? > > > On

Re: Support for local disk columnar storage for DataFrames

2015-11-12 Thread Cristian O
Sorry, apparently only replied to Reynold, meant to copy the list as well, so I'm self replying and taking the opportunity to illustrate with an example. Basically I want to conceptually do this: val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i => (i, 1)).toDF("k", "v") val

Support for local disk columnar storage for DataFrames

2015-11-11 Thread Cristian O
Hi, I was wondering if there's any planned support for local disk columnar storage. This could be an extension of the in-memory columnar store, or possibly something similar to the recently added local checkpointing for RDDs This could also have the added benefit of enabling iterative usage for

Re: Support for local disk columnar storage for DataFrames

2015-11-11 Thread Reynold Xin
Thanks for the email. Can you explain what the difference is between this and existing formats such as Parquet/ORC? On Wed, Nov 11, 2015 at 4:59 AM, Cristian O wrote: > Hi, > > I was wondering if there's any planned support for local disk columnar > storage. >