subject:"grouping by a partitioned key"

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat

Inline.. On Thu, Aug 13, 2015 at 5:06 AM, Eugene Morozov wrote: > Hemant, William, pls see inlined. > > On 12 Aug 2015, at 18:18, Philip Weaver wrote: > > Yes, I am partitoning using DataFrameWriter.partitionBy, which produces > the keyed directory structure that you referenced in that link. >

Re: grouping by a partitioned key

2015-08-12 Thread Philip Weaver

Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the keyed directory structure that you referenced in that link. On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat wrote: > As far as I know, Spark SQL cannot process data on a per-partition-basis. > DataFrame.foreachPartitio

Re: grouping by a partitioned key

2015-08-11 Thread Hemant Bhanawat

As far as I know, Spark SQL cannot process data on a per-partition-basis. DataFrame.foreachPartition is the way. I haven't tried it, but, following looks like a not-so-sophisticated way of making spark sql partition aware. http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-d

Re: grouping by a partitioned key

2015-08-11 Thread Philip Weaver

Thanks. In my particular case, I am calculating a distinct count on a key that is unique to each partition, so I want to calculate the distinct count within each partition, and then sum those. This approach will avoid moving the sets of that key around between nodes, which would be very expensive.

Re: grouping by a partitioned key

2015-08-11 Thread Eugene Morozov

Philip, If all data per key are inside just one partition, then Spark will figure that out. Can you guarantee that’s the case? What is it you try to achieve? There might be another way for it, when you might be 100% sure what’s happening. You can print debugString or explain (for DataFrame) to

grouping by a partitioned key

2015-08-11 Thread Philip Weaver

If I have an RDD that happens to already be partitioned by a key, how efficient can I expect a groupBy operation to be? I would expect that Spark shouldn't have to move data around between nodes, and simply will have a small amount of work just checking the partitions to discover that it doesn't ne

Re: grouping by a partitioned key

Re: grouping by a partitioned key

Re: grouping by a partitioned key

Re: grouping by a partitioned key

Re: grouping by a partitioned key

grouping by a partitioned key

6 matches

Site Navigation

Mail list logo

Footer information