Lance Dacey created ARROW-15474: ----------------------------------- Summary: [Python] Possibility of a table.drop_duplicates() function? Key: ARROW-15474 URL: https://issues.apache.org/jira/browse/ARROW-15474 Project: Apache Arrow Issue Type: Wish Affects Versions: 6.0.1 Reporter: Lance Dacey Fix For: 8.0.0
I noticed that there is a group_by() and sort_by() function in the 7.0.0 branch. Is it possible to include a drop_duplicates() function as well? ||id||updated_at|| |1|2022-01-01 04:23:57| |2|2022-01-01 07:19:21| |2|2022-01-10 22:14:01| Something like this which would return a table without the second row in the example above would be great. I usually am reading an append-only dataset and then I need to report on latest version of each row. To drop duplicates, I am temporarily converting the append-only table to a pandas DataFrame, and then I convert it back to a table and save a separate "latest-version" dataset. {code:python} table.sort_by(sorting=[("id", "ascending"), ("updated_at", "ascending")]).drop_duplicates(subset=["id"] keep="last") {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)