Re: value_counts after group_by

Weston Pace Fri, 26 Aug 2022 06:21:44 -0700

I'm happy to spread the word.  The thanks here go to Eduardo Ponce,
Aldrin Montana, and the various reviewers who have all worked hard to
create this doc.


On Fri, Aug 26, 2022 at 6:05 AM Suresh V <[email protected]> wrote:
>
> Hi Weston
>
> Thanks a lot for the response. I tried the list approach a while back to get 
> the group keys in this fashion and run parallel computation at group level 
> and the performance penalty for the dataset of 50m rows was way too high(2s 
> vs 8s).
>
> Thanks a lot for the awesome initiative of teaching people how to create new 
> kernels. This PR is what I was looking for and helps alleviate the learning 
> curve.
>
> Thanks
>
>
>
> On Thu, Aug 25, 2022, 8:45 PM Weston Pace <[email protected]> wrote:
>>
>> > Is there a way to get value_counts of a given column after doing table 
>> > group_by?
>>
>> Is your goal to group by some key and then get the value counts of an
>> entirely different non-key column?  If so, then no, not today, at
>> least not directly.  The only group by node we have is a hash-group-by
>> and this can only accept "hash aggregate functions".  These are
>> defined in [1] and value_counts does not have a "hash aggregate"
>> variant but it does seem like it would make sense.
>>
>> Indirectly, you can use the "list" aggregate function as a sort of escape 
>> hatch:
>>
>> ```
>> import pyarrow as pa
>> import pyarrow.compute as pc
>>
>> tab = pa.Table.from_pydict({
>>     'state': ['Washington', 'Washington', 'Colorado', 'Colorado', 
>> 'Colorado'],
>>     'city': ['Seattle', 'Seattle', 'Denver', 'Colorado Springs', 'Denver'],
>>     'temp': [70, 75, 83, 89, 94]
>> })
>>
>> grouped = pa.TableGroupBy(tab, 'state').aggregate([('city', 'list')])
>> print(grouped)
>>
>> # pyarrow.Table
>> # city_list: list<item: string>
>> #   child 0, item: string
>> # state: string
>> # ----
>> # city_list: [[["Seattle","Seattle"],["Denver","Colorado Springs","Denver"]]]
>> # state: [["Washington","Colorado"]]
>> ```
>>
>> You could then use a for-loop to walk through each cell of city_list
>> and run value_counts on that array.
>>
>> > If its not possible, can you please point me the relevant cpp/python files 
>> > I need to modify for this to work?
>>
>> You would need to create a "hash aggregate function" for value_counts
>> (it would presumably be called hash_value_counts to match the existing
>> pattern).  The starting point for understanding such functions would
>> probably be [2].  Each hash-aggregate kernel consists of 5 different
>> functions (init, resize, consume, merge, and finalize) that you will
>> need to provide.  You can use any of the other hash_* functions as
>> examples for how you might implement these.  Basically, these
>> functions take in a column of values and a column of ids and they
>> update some kind of running state (one per thread).  At the end of the
>> stream the various thread states are merged together and the finalize
>> function turns this final state into an output array.
>>
>> Work is underway on a guide to help with authoring new kernel
>> functions.  The current PR for this guide can be found at [3].
>>
>> [1] 
>> https://arrow.apache.org/docs/cpp/compute.html#grouped-aggregations-group-by
>> [2] 
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L678
>> [3] https://github.com/apache/arrow/pull/13933
>>
>> On Thu, Aug 25, 2022 at 10:26 AM Suresh V <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > Is there a way to get value_counts of a given column after doing table 
>> > group_by?
>> >
>> > If its not possible, can you please point me the relevant cpp/python files 
>> > I need to modify for this to work?
>> >
>> > Thanks
>> >
>> >

Re: value_counts after group_by

Reply via email to