Thanks Cindy, Feedback would be appreciated. I also filed https://issues.apache.org/jira/browse/ARROW-9613 so that the conversion can potentially be more efficient.
On Wed, Jul 29, 2020 at 4:16 AM Cindy McMullen <[email protected]> wrote: > Thanks, Micah, for your thoughtful response. We'll give it a try and let > you know how it goes. > > -- Cindy > > On Tue, Jul 28, 2020 at 10:20 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Cindy, >> I haven't tried this but the best guidance I can give is the following: >> 1. Create an appropriate decoder using Avro's DecoderFactory [1] >> 2. Construct an arrow adapter with a schema and the decoder. There are >> some examples in the unit tests [2]. >> 3. Adapt the method described by Uwe describes in his blog-post about >> JDBC [3] to using the adapter. From there I think you can use the >> tensorflow APIs (sorry I've not used them but my understanding is TF only >> has python APIs?) >> >> If number 3 doesn't work for you due to environment constraints, you >> could write out an Arrow file using the file writer [4] and try to see if >> examples listed in [5] help. >> >> ne thing to note is, I believe the Avro adapter library currently has an >> impedance mismatch with the ArrowFileWriter. The Adapter returns an new >> VectorStreamRoot per batch, and the Writer libraries are designed around >> loading/unloading a single VectorSchemaRoot. I think the method with the >> least overhead for transferring is the data is to create a VectorUnloader >> [6] per VectorSchemaRoot, convert it to a record batch and then load it >> into the Writer's VectorSchemaRoot. This will unfortunately cause some >> amount of memory churn due to extra allocations. >> >> There is a short overview of working with Arrow generally available at [7] >> >> Hope this helps, >> Micah >> >> [1] >> https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html >> [2] >> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77 >> [3] >> https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html >> [4] >> https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java >> [5] >> https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html >> [6] >> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java >> [7] https://arrow.apache.org/docs/java/ >> >> On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <[email protected]> >> wrote: >> >>> Hi - >>> >>> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc >>> file or SpecificRecord Java class) that I'd like to send to TensorFlow as >>> input tensors, preferably via Arrow. Can you suggest some existing >>> adapters or code patterns (Java or Scala) that I can use? >>> >>> Thanks - >>> >>> -- Cindy >>> >>
