Thanks.

In my particular case, I am calculating a distinct count on a key that is
unique to each partition, so I want to calculate the distinct count within
each partition, and then sum those. This approach will avoid moving the
sets of that key around between nodes, which would be very expensive.

Currently, to accomplish this we are manually reading in the parquet files
(not through Spark SQL), using a bitset to calculate the unique count
within each partition, and accumulating that sum. Doing this through Spark
SQL would be nice, but the naive "SELECT distinct(count(...))" approach
takes 60 times as long :). The approach I mentioned above might be an
acceptable hybrid solution.

- Philip


On Tue, Aug 11, 2015 at 3:27 PM, Eugene Morozov <fathers...@list.ru> wrote:

> Philip,
>
> If all data per key are inside just one partition, then Spark will figure
> that out. Can you guarantee that’s the case?
> What is it you try to achieve? There might be another way for it, when you
> might be 100% sure what’s happening.
>
> You can print debugString or explain (for DataFrame) to see what’s
> happening under the hood.
>
>
> On 12 Aug 2015, at 01:19, Philip Weaver <philip.wea...@gmail.com> wrote:
>
> If I have an RDD that happens to already be partitioned by a key, how
> efficient can I expect a groupBy operation to be? I would expect that Spark
> shouldn't have to move data around between nodes, and simply will have a
> small amount of work just checking the partitions to discover that it
> doesn't need to move anything around.
>
> Now, what if we're talking about a parquet database created by using
> DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group
> by a key that I'm already partitioned by?
>
> - Philip
>
>
> Eugene Morozov
> fathers...@list.ru
>
>
>
>
>

Reply via email to