Re: how to make acero output order by batch index

2023-07-25 Thread Wenbo Hu
Replacing ``` ac::Declaration source{"record_batch_reader_source", ac::RecordBatchReaderSourceNodeOptions{std::move(input)}}; ``` with ``` ac::RecordBatchSourceNodeOptions rb_source_options{ input->schema(), [input]() { return arrow::MakeFunctionIterator([input] { return input->Next(); }); }};

Re: how to make acero output order by batch index

2023-07-25 Thread Wenbo Hu
Hi, I'll open a issue on the DeclareToReader problem. I think the key problem is that the input stream is unordered. The input stream is a ArrowArrayStream imported from python side, and then declared to a "record_batch_reader_source", which is a unordered source node. So the behavior is

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-25 Thread Weston Pace
1) As a rule of thumb I would probably prefer `async_scheduler`. It's more feature rich and simpler to use and is meant to handle "long running" tasks (e.g. 10s-100s of ms or more). The scheduler is a bit more complex and is intended for very fine-grained scheduling. It's currently only used in

scheduler() and aync_scheduler() on QueryContext

2023-07-25 Thread Li Jin
Hi, I am reading Acero and got confused about the use of QueryContext::scheduler() and QueryContext::async_scheduler(). So I have a couple of questions: (1) What are the different purposes of these two? (2) Does scheduler/aysnc_scheduler own any threads inside their respective classes or do they

Re: [RESULT][VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 28.0.0 RC1

2023-07-25 Thread Andrew Lamb
Thank you! On Tue, Jul 25, 2023 at 10:43 AM Andy Grove wrote: > On Tue, Jul 25, 2023 at 8:42 AM Andy Grove wrote: > > > The vote passes with 5 +1 votes (4 binding). Thanks for verifying the > > release. > > > > I have published the release. > > > > On Mon, Jul 24, 2023 at 7:19 AM Andrew Lamb

Re: how to make acero output order by batch index

2023-07-25 Thread Weston Pace
> Reading the source code of exec_plan.cc, DeclarationToReader called > DeclarationToRecordBatchGenerator, which ignores the sequence_output > parameter in SinkNodeOptions, also, it calls validate which should > fail if the SinkNodeOptions honors the sequence_output. Then it seems > that

[RESULT][VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 28.0.0 RC1

2023-07-25 Thread Andy Grove
On Tue, Jul 25, 2023 at 8:42 AM Andy Grove wrote: > The vote passes with 5 +1 votes (4 binding). Thanks for verifying the > release. > > I have published the release. > > On Mon, Jul 24, 2023 at 7:19 AM Andrew Lamb wrote: > >> +1 (binding) >> >> Verified in x86_64 mac >> >> Thank you very much

Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 28.0.0 RC1

2023-07-25 Thread Andy Grove
The vote passes with 5 +1 votes (4 binding). Thanks for verifying the release. I have published the release. On Mon, Jul 24, 2023 at 7:19 AM Andrew Lamb wrote: > +1 (binding) > > Verified in x86_64 mac > > Thank you very much Andy. > Andrew > > On Sun, Jul 23, 2023 at 9:31 AM vin jake wrote:

Re: how to make acero output order by batch index

2023-07-25 Thread Wenbo Hu
Reading the source code of exec_plan.cc, DeclarationToReader called DeclarationToRecordBatchGenerator, which ignores the sequence_output parameter in SinkNodeOptions, also, it calls validate which should fail if the SinkNodeOptions honors the sequence_output. Then it seems that DeclarationToReader

how to make acero output order by batch index

2023-07-25 Thread Wenbo Hu
Hi, I'm trying to zip two streams with same order but different processes. For example, the original stream comes with two column 'id' and 'age', and splits into two stream processed distributedly using acero: 1. hash the 'id' into a stream with single column 'bucket_id' and 2. classify

Re: [VOTE] Release Apache Arrow 13.0.0 - RC0

2023-07-25 Thread Raúl Cumplido
Hi, During the release process we already run several verification tasks on docker containers. One of the good things of verifying it on local environments is that we are able to test different setups that are not covered on the CI jobs and we sometimes find issues there. There is an issue