Hi François,
Thank you very much for this kindly detailed analysis! Since I am not so
professional to the project
Arrow, this could help me quite a lot so that I don't have to try inventing
everything by my own
imagination.
Addressing your comment:
> Having said that, I think I understand
Le 28/11/2019 à 07:26, Hongze Zhang a écrit :
> Thanks for referencing this, Antoine. The concepts and principles seem to be
> pretty concrete so I
> may take some time to read it in detail.
>
> BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's
> unlikely clear whether
Thanks for referencing this, Antoine. The concepts and principles seem to be
pretty concrete so I
may take some time to read it in detail.
BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's
unlikely clear whether
this one or ipc flatbuffers could be a better approach for
Hi Francois,
Thanks for the proposal and your effort.
I made a simple JNI poc before for RecordBatch/VectorSchemaRoot interaction
between Java and C++[1][2].
This may help a little.
Thanks,
Ji Liu
[1] https://github.com/tianchen92/jni-poc-java
[2] https://github.com/tianchen92/jni-poc-cpp
Hello Hongze,
The C++ implementation of dataset, notably Dataset, DataSource,
DataSourceDiscovery, and Scanner classes are not ready/designed for
distributed computing. They don't serialize and they reference by
pointer all around, thus I highly doubt that you can implement parts
in Java, and
To set up bridges between Java and C++, the C data interface
specification may help:
https://github.com/apache/arrow/pull/5442
There's an implementation for C++ here, and it also includes a Python-R
bridge able to share Arrow data between two different runtimes (i.e.
PyArrow and R-Arrow were
Hi Micah,
Regarding our use cases, we'd use the API on Parquet files with some pushed
filters and projectors, and we'd extend the C++ Datasets code to provide
necessary support for our own data formats.
> If JNI is seen as too cumbersome, another possible avenue to pursue is
> writing a gRPC
Hi Hongze,
I have a strong preference for not porting non-trivial logic from one
language to another, especially if the main goal is performance. I think
this will replicate bugs and cause confusion if inconsistencies occur. It
is also a non-trivial amount of work to develop, review, setup CI,
Hi Wes and Micah,
Thanks for your kindly reply.
Micah: We don't use Spark (vectorized) parquet reader because it is a pure Java
implementation. Performance could be worse than doing the similar work
natively. Another reason is we may need to
integrate some other specific data sources with
Hi Hongze,
To add to Wes's point, there are already some efforts to do JNI for ORC
(which needs to be integrated with CI) and some open PRs for Parquet in the
project. However, given that you are using Spark I would expect there is
already dataset functionality that is equivalent to the dataset
hi Hongze,
The Datasets functionality is indeed extremely useful, and it may make
sense to have it available in many languages eventually. With Java, I
would raise the issue that things are comparatively weaker there when
it comes to actually reading the files themselves. Whereas we have
11 matches
Mail list logo