Re: [VOTE][RUST][Datafusion] Release Apache Arrow Datafusion 5.0.0 RC3

2021-08-11 Thread Wayne Xia
Hi QP, When running this script I noticed that this might be because I was not using a stable toolchain when testing. Those failures occur with nightly (which is my default toolchain). And everything works fine after switching to stable 1.54. So I think it's ok from my side to vote +1. BTW, I thi

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021, 19:05 Weston Pace wrote: > >> The benefit is that IR components don't interact much with > `flatbuffers` or > >> `flatc` directly. > >> > [...] > >> > >> One counter-proposal might be to just put the compute IR IDL in a > separate > >> repo, > >> but that isn't tenable becau

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Weston Pace
>> The benefit is that IR components don't interact much with `flatbuffers` or >> `flatc` directly. >> [...] >> >> One counter-proposal might be to just put the compute IR IDL in a separate >> repo, >> but that isn't tenable because the compute IR needs arrow's type information >> contained in `Sch

[Rust] Integration tests for recursive nested data?

2021-08-11 Thread Micah Kornfield
One of my PRs is showing integration test failures with Rust [1] for the recursive nested test. With an error: Validating /tmp/tmpg3zo83yh/de3ef975_generated_recursive_nested.json_as_file and /tmp/arrow-integration-0bm7hcmd/generated_recursive_nested.json Schemas match. JSON file has 2 batches. t

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 23:06, Phillip Cloud a écrit : On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou wrote: Le 11/08/2021 à 22:16, Phillip Cloud a écrit : Yeah, that is a drawback here, though I don't see needing to run flatc as a major downside given the upside of not having to write additiona

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Couple of questions > > 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the > operation "(IR,data) - engine -> result" MUST be the same for all "engine"? > I think that might be a non-sta

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:21 PM David Li wrote: > If the worry is public distribution (i.e. requiring all downstream > projects to also run flatc in their builds) we could perhaps ship a package > that just consists of the generated code (though that's definitely more > packaging burden, and won'

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 22:16, Phillip Cloud a écrit : > > > > Yeah, that is a drawback here, though I don't see needing to run flatc > as a > > major downside given the upside > > of not having to write additional code to move between formats. >

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Jorge Cardoso Leitão
Couple of questions 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the operation "(IR,data) - engine -> result" MUST be the same for all "engine"? 2. if yes, imo we may need to worry about: * a definition of equality that implementations agree on. * agreement over what the sem

Re: Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-08-11 Thread Micah Kornfield
As an update, I've gotten basic integration testing working in Java and C++ along with the format proposal updates [1]. I have a little bit more work to do on the initial implementations (make CI happy, add unit tests in Java) but I think this is getting close to the point that we can vote on it.

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 22:20, David Li a écrit : If the worry is public distribution (i.e. requiring all downstream projects to also run flatc in their builds) we could perhaps ship a package that just consists of the generated code (though that's definitely more packaging burden, and won't help wh

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 22:16, Phillip Cloud a écrit : Yeah, that is a drawback here, though I don't see needing to run flatc as a major downside given the upside of not having to write additional code to move between formats. That's only an advantage if you already know how to read the Arrow IPC f

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread David Li
If the worry is public distribution (i.e. requiring all downstream projects to also run flatc in their builds) we could perhaps ship a package that just consists of the generated code (though that's definitely more packaging burden, and won't help when you're doing development against in-progres

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 22:02, Phillip Cloud a écrit : > > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou > wrote: > > > >> > >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > >>> I can see how that might be a bit circular. Let me start from th

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 22:02, Phillip Cloud a écrit : On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou wrote: Le 11/08/2021 à 21:56, Phillip Cloud a écrit : I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > > I can see how that might be a bit circular. Let me start from the > > perspective of requirements. We want to be able to reuse the arrow's > types > > and schema, without having to write a

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 21:56, Phillip Cloud a écrit : I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types and schema, without having to write additional code to move back and forth between compute IR and not-compu

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types and schema, without having to write additional code to move back and forth between compute IR and not-compute-IR. I think that leaves only flatbuffers as an o

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 3:51 PM Antoine Pitrou wrote: > > > Le 11/08/2021 à 21:39, Phillip Cloud a écrit : > > The benefit is that IR components don't interact much with `flatbuffers` > or > > `flatc` directly. > > > [...] > > > > One counter-proposal might be to just put the compute IR IDL in a

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 21:39, Phillip Cloud a écrit : The benefit is that IR components don't interact much with `flatbuffers` or `flatc` directly. [...] One counter-proposal might be to just put the compute IR IDL in a separate repo, but that isn't tenable because the compute IR needs arrow's ty

[DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
Hi all, I'd like to bring up an idea from a recent thread ([1]) about moving the `format/` directory out of the primary apache/arrow repository. I understand from that thread there are some concerns about using submodules, and I definitely sympathize with those concerns. In talking with David Li

Re: [VOTE][RUST][Datafusion] Release Apache Arrow Datafusion 5.0.0 RC3

2021-08-11 Thread QP Hou
Hi Ruihang, Thanks for helping with the validation. It would certainly be helpful if you could share the error log with me. I have also prepared an updated version of the verification script at https://github.com/houqp/arrow-datafusion/blob/qp_release/dev/release/verify-release-candidate.sh. This

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Phillip Cloud
Thanks Wes, Great to be back working on Arrow again and engaging with the community. I am really excited about this effort. I think there are a number of concerns I see as important to address in the compute IR proposal: 1. Requirement for output types. I think that so far there's been many rea

Re: [VOTE][RUST][Datafusion] Release Apache Arrow Datafusion 5.0.0 RC3

2021-08-11 Thread Andy Grove
+1 (binding) Verification process: - Checked shasum - Ran cargo test --all - Ran Ballista integration tests - Manually verified Cargo.toml dependencies for Ballista On Wed, Aug 11, 2021 at 3:20 AM Andrew Lamb wrote: > +1 (binding) > > I verified the signature, checked shasums and ran `cargo t

Re: [VOTE][RUST][Datafusion] Release Apache Arrow Datafusion 5.0.0 RC3

2021-08-11 Thread Wayne Xia
Thanks, QP! I verified the signature and checked shasum, but got 3 failed case while testing: - execution_plans::shuffle_writer::tests::test - execution_plans::shuffle_writer::tests::test_partitioned - physical_plan::repartition::tests::repartition_with_dropping_output_stream I set up env `ARROW

Re: [VOTE][RUST][Datafusion] Release Apache Arrow Datafusion 5.0.0 RC3

2021-08-11 Thread Andrew Lamb
+1 (binding) I verified the signature, checked shasums and ran `cargo test --all` Andrew On Wed, Aug 11, 2021 at 2:03 AM QP Hou wrote: > Hi, > > I would like to propose a release of Apache Arrow Datafusion > Implementation, > version 5.0.0. > > RC3 fixed a cargo publish issue discovered in R