Re: Datasets and Java

2019-11-28 Thread Hongze Zhang
Hi François, Thank you very much for this kindly detailed analysis! Since I am not so professional to the project Arrow, this could help me quite a lot so that I don't have to try inventing everything by my own imagination. Addressing your comment: > Having said that, I think I understand

Re: Datasets and Java

2019-11-28 Thread Antoine Pitrou
Le 28/11/2019 à 07:26, Hongze Zhang a écrit : > Thanks for referencing this, Antoine. The concepts and principles seem to be > pretty concrete so I > may take some time to read it in detail. > > BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's > unlikely clear whether

Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Thanks for referencing this, Antoine. The concepts and principles seem to be pretty concrete so I may take some time to read it in detail. BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's unlikely clear whether this one or ipc flatbuffers could be a better approach for

Re: Datasets and Java

2019-11-27 Thread Ji Liu
Hi Francois, Thanks for the proposal and your effort. I made a simple JNI poc before for RecordBatch/VectorSchemaRoot interaction between Java and C++[1][2]. This may help a little. Thanks, Ji Liu [1] https://github.com/tianchen92/jni-poc-java [2] https://github.com/tianchen92/jni-poc-cpp

Re: Datasets and Java

2019-11-27 Thread Francois Saint-Jacques
Hello Hongze, The C++ implementation of dataset, notably Dataset, DataSource, DataSourceDiscovery, and Scanner classes are not ready/designed for distributed computing. They don't serialize and they reference by pointer all around, thus I highly doubt that you can implement parts in Java, and

Re: Datasets and Java

2019-11-27 Thread Antoine Pitrou
To set up bridges between Java and C++, the C data interface specification may help: https://github.com/apache/arrow/pull/5442 There's an implementation for C++ here, and it also includes a Python-R bridge able to share Arrow data between two different runtimes (i.e. PyArrow and R-Arrow were

Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Hi Micah, Regarding our use cases, we'd use the API on Parquet files with some pushed filters and projectors, and we'd extend the C++ Datasets code to provide necessary support for our own data formats. > If JNI is seen as too cumbersome, another possible avenue to pursue is > writing a gRPC

Re: Datasets and Java

2019-11-27 Thread Micah Kornfield
Hi Hongze, I have a strong preference for not porting non-trivial logic from one language to another, especially if the main goal is performance. I think this will replicate bugs and cause confusion if inconsistencies occur. It is also a non-trivial amount of work to develop, review, setup CI,

Re: Datasets and Java

2019-11-26 Thread Hongze Zhang
Hi Wes and Micah, Thanks for your kindly reply. Micah: We don't use Spark (vectorized) parquet reader because it is a pure Java implementation. Performance could be worse than doing the similar work natively. Another reason is we may need to integrate some other specific data sources with

Re: Datasets and Java

2019-11-26 Thread Micah Kornfield
Hi Hongze, To add to Wes's point, there are already some efforts to do JNI for ORC (which needs to be integrated with CI) and some open PRs for Parquet in the project. However, given that you are using Spark I would expect there is already dataset functionality that is equivalent to the dataset

Re: Datasets and Java

2019-11-26 Thread Wes McKinney
hi Hongze, The Datasets functionality is indeed extremely useful, and it may make sense to have it available in many languages eventually. With Java, I would raise the issue that things are comparatively weaker there when it comes to actually reading the files themselves. Whereas we have