When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Hi,       In Spark SQL help document, it says Some of these (such as indexes) are less important due to Spark SQL’s in-memory  computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes)      For our

Re: RDD data flow

2014-12-17 Thread Madhu
Patrick Wendell wrote The Partition itself doesn't need to be an iterator - the iterator comes from the result of compute(partition). The Partition is just an identifier for that partition, not the data itself. OK, that makes sense. The docs for Partition are a bit vague on this point. Maybe

Re: running the Terasort example

2014-12-17 Thread Tim Harsch
On 12/16/14, 11:42 PM, Ewan Higgs ewan.hi...@ugent.be wrote: Hi Tim, On 16 Dec 2014, at 19:27, Tim Harsch thar...@cray.com wrote: Hi Ewan, Thanks, I think I was just a bit confused at the time, I was looking at the spark-perf repo when there was the problem (uh.. ok)… The PR that I am

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-17 Thread Josh Rosen
Yeah, it looks like messages that are successfully posted via Nabble end up on the Apache mailing list, but messages posted directly to Apache aren't mirrored to Nabble anymore because it's based off the incubator mailing list. We should fix this so that Nabble posts to / archives the

Fwd: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-17 Thread Krishna Sankar
Forgot Reply To All ;o( -- Forwarded message -- From: Krishna Sankar ksanka...@gmail.com Date: Wed, Dec 10, 2014 at 9:16 PM Subject: Re: [VOTE] Release Apache Spark 1.2.0 (RC2) To: Matei Zaharia matei.zaha...@gmail.com +1 Works same as RC1 1. Compiled OSX 10.10 (Yosemite) mvn

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On