Note groupby.apply [1] in particular should be able to do what you want,
something like:

  df.groupby('key1').apply(lambda df: df.sort_values('key2'))

But as Robert noted we don't make any guarantees about preserving this
ordering later in the pipeline. For this reason I actually just sent a PR
to disallow sort_values on the entire dataset [2].

Brian

[1] https://github.com/apache/beam/pull/13843
[2] https://github.com/apache/beam/pull/14324

On Fri, Apr 2, 2021 at 9:15 AM Robert Bradshaw <[email protected]> wrote:

> Thanks for trying this out.
>
> Better support for groupby (e.g. https://github.com/apache/beam/pull/13843
> , https://github.com/apache/beam/pull/13637) will be available in the
> next Beam release (2.29, in progress, but you could try out head if you
> want). Note, however, that Beam PCollections are by definition unordered,
> so unless you sort a partition and immediately do something with it that
> ordering may not be preserved. If you could let us know what you're trying
> to do with this ordering that would be helpful.
>
> - Robert
>
>
> On Thu, Apr 1, 2021 at 7:31 PM Wenbing Bai <[email protected]>
> wrote:
>
>> Hi Beam users,
>>
>> I have a user case to partition my PCollection by some key, and then sort
>> my rows within the same partition by some other key.
>>
>> I feel Beam Dataframe could be a candidate solution, but I cannot figure
>> out how to make it work. Specifically, I tried df.groupby where I expect my
>> data will be distributed to different nodes. I also tried df.sort_values,
>> but it will sort my whole dataset, which is not what I need.
>>
>> Can someone shed some light on this?
>>
>>
>>
>>
>>
>> Wenbing Bai
>>
>> Senior Software Engineer
>>
>> Data Infrastructure, Cruise
>>
>> Pronouns: She/Her
>>
>>
>>
>> *Confidentiality Note:* We care about protecting our proprietary
>> information, confidential material, and trade secrets. This message may
>> contain some or all of those things. Cruise will suffer material harm if
>> anyone other than the intended recipient disseminates or takes any action
>> based on this message. If you have received this message (including any
>> attachments) in error, please delete it immediately and notify the sender
>> promptly.
>
>

Reply via email to