Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-13 Thread Will Jones
Hello Arrow devs, Just a quick note. To answer one of my earlier questions: 1. Is this array type currently only used in Velox? (not DuckDB like some > of the other new types?) What evidence do we have that it will become used > outside of Velox? > This type is also used by DuckDB. Found

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
(Admittedly, PR title of [1] doesn't reflect that only the scalar aggregate UDF is implemented and not the hash one - that is an oversight on my part - sorry) On Tue, Jun 13, 2023 at 3:51 PM Li Jin wrote: > Thanks Weston. > > I think I found what you pointed out to me before which is this bit

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
Thanks Weston. I think I found what you pointed out to me before which is this bit of code: https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/partition.cc#L118 I will try if I can adapt this to be used in streaming situation. > I know you recently added [1] and I'm maybe a little

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Weston Pace
Are you looking for something in C++ or python? We have a thing called the "grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which (if memory serves) is the heart of the functionality in C++. It would be nice to add some python bindings for this functionality as this ask comes

Re: [ANNOUNCE] New Arrow PMC member: Jie Wen (jakevin / jackwener)

2023-06-13 Thread David Li
Welcome Jie! On Tue, Jun 13, 2023, at 10:25, Weston Pace wrote: > Congratulations > > On Tue, Jun 13, 2023, 1:28 AM Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > >> Congratulations! >> >> On Mon, 12 Jun 2023 at 22:00, Raúl Cumplido >> wrote: >> > >> > Congratulations Jie!!! >>

Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
Hi, I am trying to write a function that takes a stream of record batches (where the last column is group id), and produces k record batches, where record batches k_i contain all the rows with group id == i. Pseudocode is sth like: def group_rows(batches, k) -> array[RecordBatch] {

Re: Converting Pandas DataFrame <-> Struct Array?

2023-06-13 Thread Li Jin
Gotcha - If there is no penalty from RecordBatch<->StructArray then I am happy with the current approach - thanks! For Spencer's question, the reason that I use StructArray is because the kernel interfaces I am interested in uses Array interface instead of RecordBatch, so StructArray is easier

Re: [ANNOUNCE] New Arrow PMC member: Jie Wen (jakevin / jackwener)

2023-06-13 Thread Weston Pace
Congratulations On Tue, Jun 13, 2023, 1:28 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Congratulations! > > On Mon, 12 Jun 2023 at 22:00, Raúl Cumplido > wrote: > > > > Congratulations Jie!!! > > > > El lun, 12 jun 2023, 20:35, Matt Topol > escribió: > > > > > Congrats

Re: [RESULT][VOTE] Release Apache Arrow 12.0.1 - RC1

2023-06-13 Thread Raúl Cumplido
Hi, I've had an issue with the post-11-bump-versions.sh script. For patch releases the script fails unless using the `BUMP_DEB_PACKAGE_NAMES=0` flag. This is not documented and I had to test several retries locally to understand what the issue was. The problem is that this script commits and

Re: [RESULT][VOTE] Release Apache Arrow 12.0.1 - RC1

2023-06-13 Thread Raúl Cumplido
Thanks Nic for helping me with uploading sources and adding the release to the Apache Reporter System. This is the current status of the post-release tasks: - [done] Update the released milestone Date and set to "Closed" on GitHub - [done] Merge changes on release branch to maintenance branch

Re: [ANNOUNCE] New Arrow PMC member: Jie Wen (jakevin / jackwener)

2023-06-13 Thread Joris Van den Bossche
Congratulations! On Mon, 12 Jun 2023 at 22:00, Raúl Cumplido wrote: > > Congratulations Jie!!! > > El lun, 12 jun 2023, 20:35, Matt Topol escribió: > > > Congrats Jie! > > > > On Sun, Jun 11, 2023 at 9:20 AM Andrew Lamb wrote: > > > > > The Project Management Committee (PMC) for Apache Arrow

Re: [Python] Dataset scanner fragment skip options.

2023-06-13 Thread Joris Van den Bossche
On Mon, 12 Jun 2023 at 21:30, Jerald Alex wrote: > > hi Weston, > > Thank you so much for taking the time to respond. Really appreciate it. > > I'm using parquet files. So would it be possible to elaborate the below.? I > cannot seem to find any documentation for ParquetFileFragment. > > "there

Re: Converting Pandas DataFrame <-> Struct Array?

2023-06-13 Thread Joris Van den Bossche
I think your original code roundtripping through RecordBatch (`pa.RecordBatch.from_pandas(df).to_struct_array()`) is the best option at the moment. The RecordBatch<->StructArray part is a cheap (zero-copy) conversion, and by using RecordBatch.from_pandas, you can rely on all pandas<->arrow

[RESULT][VOTE] Release Apache Arrow 12.0.1 - RC1

2023-06-13 Thread Raúl Cumplido
Hi, Thanks everyone. The result of the vote is successful with 3 +1 binding votes, 3 +1 non-binding vote and no -1 votes. I will start the post release tasks for 12.0.1 [1]. Thanks, Raúl [1] https://arrow.apache.org/docs/dev/developers/release.html#post-release-tasks El mar, 13 jun 2023 a las