Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Thanks for referencing this, Antoine. The concepts and principles seem to be pretty concrete so I may take some time to read it in detail. BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's unlikely clear whether this one or ipc flatbuffers could be a better approach for

Re: PyArrow.Table schema.metadata issue

2019-11-27 Thread Aaron Chu
Dear all, I need your help regarding the pyarrow.table.schema. I tried to create a schema and use with_metadata/add_metadata functions to add the metadata (a python dict) to the schema. However, nothing showed up when I run 'schema.metadata'. I can't get the metadata added to the schema. This

Re: Datasets and Java

2019-11-27 Thread Ji Liu
Hi Francois, Thanks for the proposal and your effort. I made a simple JNI poc before for RecordBatch/VectorSchemaRoot interaction between Java and C++[1][2]. This may help a little. Thanks, Ji Liu [1] https://github.com/tianchen92/jni-poc-java [2] https://github.com/tianchen92/jni-poc-cpp

Re: Datasets and Java

2019-11-27 Thread Francois Saint-Jacques
Hello Hongze, The C++ implementation of dataset, notably Dataset, DataSource, DataSourceDiscovery, and Scanner classes are not ready/designed for distributed computing. They don't serialize and they reference by pointer all around, thus I highly doubt that you can implement parts in Java, and

Re: Apache Arrow sync now

2019-11-27 Thread Francois Saint-Jacques
Attendees: - Micah Kornfield, Google - Praveen Kumar, Dremio - Todd Hendricks - François Saint-Jacques RStudio/Ursa Labs Subject - Bazel. Micah wants feedback on the PR. This first is aimed a developer productivity, notably shorter link time and sandboxed build. As a first PoC, parts of the

[jira] [Created] (ARROW-7272) [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot

2019-11-27 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7272: - Summary: [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot Key: ARROW-7272 URL: https://issues.apache.org/jira/browse/ARROW-7272

Re: Strategy for mixing large_string and string with chunked arrays

2019-11-27 Thread Wes McKinney
On Tue, Nov 26, 2019 at 9:40 AM Maarten Breddels wrote: > > Op di 26 nov. 2019 om 15:02 schreef Wes McKinney : > > > hi Maarten > > > > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based > > on this. > > > > I think that normalizing to a common type (which would require

Re: [DISCUSS][C++/Python] Bazel example

2019-11-27 Thread Micah Kornfield
> > I don't get how this is a cycle. It only means Bazel is too limited to > distinguish between a header dependency and a C++ module? Agreed, this isn't a true cycle, but bazel is opinionated about this (i.e. forces workarounds). In the example I highlighted it might have been cleaner to

Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-11-27 Thread Jacques Nadeau
Fair enough. I'm okay with the bytes approach and the proposal looks good to me. On Fri, Nov 8, 2019 at 11:37 AM David Li wrote: > I've updated the proposal. > > On the subject of Protobuf Any vs bytes, and how to handle > errors/metadata, I still think using bytes is preferable: > - It doesn't

Apache Arrow sync now

2019-11-27 Thread Wes McKinney
https://meet.google.com/vtm-teks-phx I'm unable to join on account of the Thanksgiving holiday, but others are welcome to discuss and share call notes after

Re: [DISCUSS][C++/Python] Bazel example

2019-11-27 Thread Antoine Pitrou
Le 27/11/2019 à 06:16, Micah Kornfield a écrit : > >> Can you give an example of circular dependency? Can this be solved by >> having more "type_fwd.h" headers for forward declarations of opaque types? > > I think the type_fwd.h might contribute to the problem. The solution would > be more

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-27-0

2019-11-27 Thread Krisztián Szűcs
The flight compilation error occurring in the Conda builds are caused by a recent protobuf conda-forge update and should be fixed by https://github.com/apache/arrow/pull/5917 On Wed, Nov 27, 2019 at 2:01 PM Crossbow wrote: > > Arrow Build Report for Job nightly-2019-11-27-0 > > All tasks: >

[jira] [Created] (ARROW-7271) [C++][Flight] Use the single parameter version of SetTotalBytesLimit

2019-11-27 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7271: -- Summary: [C++][Flight] Use the single parameter version of SetTotalBytesLimit Key: ARROW-7271 URL: https://issues.apache.org/jira/browse/ARROW-7271 Project:

[NIGHTLY] Arrow Build Report for Job nightly-2019-11-27-0

2019-11-27 Thread Crossbow
Arrow Build Report for Job nightly-2019-11-27-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0 Failed Tasks: - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-homebrew-cpp - test-conda-cpp:

Re: Datasets and Java

2019-11-27 Thread Antoine Pitrou
To set up bridges between Java and C++, the C data interface specification may help: https://github.com/apache/arrow/pull/5442 There's an implementation for C++ here, and it also includes a Python-R bridge able to share Arrow data between two different runtimes (i.e. PyArrow and R-Arrow were

Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Hi Micah, Regarding our use cases, we'd use the API on Parquet files with some pushed filters and projectors, and we'd extend the C++ Datasets code to provide necessary support for our own data formats. > If JNI is seen as too cumbersome, another possible avenue to pursue is > writing a gRPC

[jira] [Created] (ARROW-7270) [Go] preserve CSV reading behaviour, improve memory usage

2019-11-27 Thread Sebastien Binet (Jira)
Sebastien Binet created ARROW-7270: -- Summary: [Go] preserve CSV reading behaviour, improve memory usage Key: ARROW-7270 URL: https://issues.apache.org/jira/browse/ARROW-7270 Project: Apache Arrow

Re: Datasets and Java

2019-11-27 Thread Micah Kornfield
Hi Hongze, I have a strong preference for not porting non-trivial logic from one language to another, especially if the main goal is performance. I think this will replicate bugs and cause confusion if inconsistencies occur. It is also a non-trivial amount of work to develop, review, setup CI,