Re: Avro -> TensorFlow

Micah Kornfield Fri, 31 Jul 2020 19:53:43 -0700

Thanks Cindy,
Feedback would be appreciated.  I also filed
https://issues.apache.org/jira/browse/ARROW-9613 so that the conversion can
potentially be more efficient.


On Wed, Jul 29, 2020 at 4:16 AM Cindy McMullen <[email protected]>
wrote:

> Thanks, Micah, for your thoughtful response.  We'll give it a try and let
> you know how it goes.
>
> -- Cindy
>
> On Tue, Jul 28, 2020 at 10:20 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Cindy,
>> I haven't tried this but the best guidance I can give is the following:
>> 1.   Create an appropriate decoder using Avro's DecoderFactory [1]
>> 2.  Construct an arrow adapter with a schema and the decoder.  There are
>> some examples in the unit tests [2].
>> 3.  Adapt the method described by Uwe describes in his blog-post about
>> JDBC [3] to using the adapter.  From there I think you can use the
>> tensorflow APIs (sorry I've not used them but my understanding is TF only
>> has python APIs?)
>>
>> If number 3 doesn't work for you due to environment constraints, you
>> could write out an Arrow file using the file writer [4] and try to see if
>> examples listed in [5] help.
>>
>>  ne thing to note is, I believe the Avro adapter library currently has an
>> impedance mismatch with the ArrowFileWriter.  The Adapter returns an new
>> VectorStreamRoot per batch, and the Writer libraries are designed around
>> loading/unloading a single VectorSchemaRoot.  I think the method with the
>> least overhead for transferring is the data is to create a VectorUnloader
>> [6] per VectorSchemaRoot, convert it to a record batch and then load it
>> into the Writer's VectorSchemaRoot.  This will unfortunately cause some
>> amount of memory churn due to extra allocations.
>>
>> There is a short overview of working with Arrow generally available at [7]
>>
>> Hope this helps,
>> Micah
>>
>> [1]
>> https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html
>> [2]
>> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77
>> [3]
>> https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
>> [4]
>> https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java
>> [5]
>> https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html
>> [6]
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
>> [7] https://arrow.apache.org/docs/java/
>>
>> On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <[email protected]>
>> wrote:
>>
>>> Hi -
>>>
>>> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc
>>> file or SpecificRecord Java class) that I'd like to send to TensorFlow as
>>> input tensors, preferably via Arrow.  Can you suggest some existing
>>> adapters or code patterns (Java or Scala) that I can use?
>>>
>>> Thanks -
>>>
>>> -- Cindy
>>>
>>

Re: Avro -> TensorFlow

Reply via email to