Hi Micah, Thanks for the fantastic summary of what to do.
I’ll have a play with it in the next few weeks. Will keep you posted. Chris > On 12 Jun 2020, at 2:05 pm, Micah Kornfield <[email protected]> wrote: > > Hi Chris, > There isn't anything prepackaged for this use-case as far as I know. As Uwe > mentioned it would probably be nice to build something using the C interface > for this purpose, but I think you should be able to do it today as described > below. > > I think you can pass ArrowBuf pointers to python via foreign_buffer [1], but > as far as I know, you would probably have to do some amount manual > reconstructions of arrays from buffers. The rough steps would be: > 1. Serialize the schema on the java side side [2] and obtain a memory > address from it to share with python (via foreign_buffer) . > 2. Deserialize the schema on the python side using pyarrow.ipc.read_schema > [3] > 3. Extract the buffer address/lengths in java (example from Gandiva [4]) and > reconstruct with foreign_object > 4. Traverse DataTypes the pyarrow schema to reconstruct the arrays [5] based > on number of buffers required [6]. > > If you do end up doing this, then I think #4 might make a nice contribution > to the project. > > Thanks, > Micah > > [1] > https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html > <https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html> > [2] > https://arrow.apache.org/docs/java/org/apache/arrow/vector/ipc/message/MessageSerializer.html#serializeMetadata-org.apache.arrow.vector.types.pojo.Schema > > <https://arrow.apache.org/docs/java/org/apache/arrow/vector/ipc/message/MessageSerializer.html#serializeMetadata-org.apache.arrow.vector.types.pojo.Schema> > [3] > https://github.com/apache/arrow/blob/1164079d5442c3910c18549bfcd2e68d4554b909/python/pyarrow/ipc.pxi#L577 > > <https://github.com/apache/arrow/blob/1164079d5442c3910c18549bfcd2e68d4554b909/python/pyarrow/ipc.pxi#L577> > [4] > https://github.com/apache/arrow/blob/17bdb5af9b3c63f6cbef57e88a6d2513e781b532/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java#L139 > > <https://github.com/apache/arrow/blob/17bdb5af9b3c63f6cbef57e88a6d2513e781b532/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java#L139> > > <https://github.com/apache/arrow/blob/17bdb5af9b3c63f6cbef57e88a6d2513e781b532/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java#L139> > [5] > https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.from_buffers > > <https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.from_buffers> > [6] > https://arrow.apache.org/docs/python/generated/pyarrow.DataType.html#pyarrow.DataType.num_buffers > > <https://arrow.apache.org/docs/python/generated/pyarrow.DataType.html#pyarrow.DataType.num_buffers> > > > On Mon, Jun 8, 2020 at 12:55 AM Chris Zheng <[email protected] > <mailto:[email protected]>> wrote: > That blog post is really good. However, I’d like to do this in a running JVM > as opposed to a python program. > > >> On 8 Jun 2020, at 11:24 am, Micah Kornfield <[email protected] >> <mailto:[email protected]>> wrote: >> >> Uwe wrote a blog post [1] on how to do this with PY4J a while ago. I think >> this ends up being zero copy but not 100% sure. >> >> [1] >> https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html >> >> <https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html>
