Inline..
On Thu, Aug 13, 2015 at 5:06 AM, Eugene Morozov wrote:
> Hemant, William, pls see inlined.
>
> On 12 Aug 2015, at 18:18, Philip Weaver wrote:
>
> Yes, I am partitoning using DataFrameWriter.partitionBy, which produces
> the keyed directory structure that you referenced in that link.
>
Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the
keyed directory structure that you referenced in that link.
On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat
wrote:
> As far as I know, Spark SQL cannot process data on a per-partition-basis.
> DataFrame.foreachPartitio
As far as I know, Spark SQL cannot process data on a per-partition-basis.
DataFrame.foreachPartition is the way.
I haven't tried it, but, following looks like a not-so-sophisticated way of
making spark sql partition aware.
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-d
Thanks.
In my particular case, I am calculating a distinct count on a key that is
unique to each partition, so I want to calculate the distinct count within
each partition, and then sum those. This approach will avoid moving the
sets of that key around between nodes, which would be very expensive.
Philip,
If all data per key are inside just one partition, then Spark will figure that
out. Can you guarantee that’s the case?
What is it you try to achieve? There might be another way for it, when you
might be 100% sure what’s happening.
You can print debugString or explain (for DataFrame) to
If I have an RDD that happens to already be partitioned by a key, how
efficient can I expect a groupBy operation to be? I would expect that Spark
shouldn't have to move data around between nodes, and simply will have a
small amount of work just checking the partitions to discover that it
doesn't ne