Review request for Dataset Java API PRs

2021-08-03 Thread Hongze Zhang
Hi, I have some PRs that were to improve Dataset API's Java implementation have not been reviewing for months. Could someone help me to review them? Thanks in advance. 1. https://github.com/apache/arrow/pull/10201 ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via

Re: [DISCUSS] Datasets API plugins?

2021-08-03 Thread Wes McKinney
I think if someone wants to build a plugin model for datasets / file formats (and refactor the existing "built-in" formats to use those plugin APIs), that sounds like a fine idea to me. I don't think the idea was for the API to be closed only to the formats that are implemented inside the Arrow

Re: [DISCUSS] next iteration of flatbuffer structures

2021-08-03 Thread Wes McKinney
Another Flatbuffers/Message.fbs project we should rekindle soon, in addition to the schema evolution/replacement question which has been raised with Flight, is that of sparse/compressed data (e.g. RLE). I have a vacation plus some travel coming up so won't be able to devote meaningful attention to

RE: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-03 Thread paddy horan
Hi Jorge, I see value in consolidating development in a single repo and releasing under the existing arrow crate. Regarding versioning, I think once we follow semantic versioning we are fine. I don't think it's worth migrating to a different repo and crate to comply with the de-facto

Re: [DISCUSS][C++] High level updates on multicore / nested parallelism strategy, work-stealing, etc.?

2021-08-03 Thread Weston Pace
I'd break things into (at least) four subproblems. # Nested fork/join Deadlock The original problem I set out to solve was the problem of nested fork/joins leading to deadlock. In particular, the parquet reader issues a fork/join per column and the dataset scanner issues a fork/join per file.

[DISCUSS][C++] High level updates on multicore / nested parallelism strategy, work-stealing, etc.?

2021-08-03 Thread Wes McKinney
hi all, We've had some discussions in the past about our approach to nested parallelism (for example, reading multiple Parquet or CSV files or compressed Arrow IPC files in parallel, each of which can benefit from internal parallelism for faster parsing / decoding performance). Since then, there

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-03 Thread Jorge Cardoso Leitão
Hi Paddy, > What do you think about moving Arrow2 into the main Arrow repo where it is only enabled via an "experimental" feature flag? AFAIK this is already possible: * add `arrow2 = { version = "0.2.0", optional = true }` to Cargo.toml * add `#[cfg(feature = "arrow2")]\npub mod arrow2;\n` to

Arrow sync call August 3 at 12:00 US/Eastern, 16:00 UTC

2021-08-03 Thread Jonathan Keane
Hello everyone, Our biweekly sync call is tomorrow (3 August) at 12:00 noon Eastern time. For today's call, let's please us this Google Meet URL (different from the usual one): https://meet.google.com/vbq-yufg-zwr?authuser=0 All are welcome to join. Notes will be shared with the mailing list

Re: Recent Flatbuffers warns about non-snake-case field names

2021-08-03 Thread Wes McKinney
flatc does have the option to disable warnings (--no-warnings) On Tue, Aug 3, 2021 at 2:26 PM Micah Kornfield wrote: > > > > > Is it something that can be done in a major version release? > > > This seems like it would be a major version release of the specification, > which I think we were

Re: Recent Flatbuffers warns about non-snake-case field names

2021-08-03 Thread Micah Kornfield
> > Is it something that can be done in a major version release? This seems like it would be a major version release of the specification, which I think we were trying to essentially avoid in any reasonable time frame. Is there no way to turn the warnings off? On Mon, Aug 2, 2021 at 2:11 PM

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-03 Thread Benjamin Blodgett
great idea! On Tue, Aug 3, 2021 at 8:49 AM Andy Grove wrote: > I also like the idea of moving arrow2/parquet2 into the official repos. > This is effectively what we did with Ballista, which is still experimental. > Ballista was simpler because it depends on DataFusion rather than the other >

Re: 5.0.0 Release and Release Manager

2021-08-03 Thread Ian Cook
We should post the 5.0.0 release blog post soon. If anyone would like to review the content or make changes or additions, please do so as soon as possible: https://github.com/apache/arrow-site/pull/127 Thanks, Ian On Fri, Jul 16, 2021 at 1:44 PM Neal Richardson wrote: > > I've started a draft

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-03 Thread Andy Grove
I also like the idea of moving arrow2/parquet2 into the official repos. This is effectively what we did with Ballista, which is still experimental. Ballista was simpler because it depends on DataFusion rather than the other way around, but I like the idea of using feature flags to enable

RE: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-03 Thread paddy horan
Hi Jorge, What do you think about moving Arrow2 into the main Arrow repo where it is only enabled via an "experimental" feature flag? This would allow development of Arrow2 to proceed in the main repo but also this would be a clear signal that Arrow2 is <1.0. When we feel ready (i.e. Arrow2