Hi Cindy, I haven't tried this but the best guidance I can give is the following: 1. Create an appropriate decoder using Avro's DecoderFactory [1] 2. Construct an arrow adapter with a schema and the decoder. There are some examples in the unit tests [2]. 3. Adapt the method described by Uwe describes in his blog-post about JDBC [3] to using the adapter. From there I think you can use the tensorflow APIs (sorry I've not used them but my understanding is TF only has python APIs?)
If number 3 doesn't work for you due to environment constraints, you could write out an Arrow file using the file writer [4] and try to see if examples listed in [5] help. ne thing to note is, I believe the Avro adapter library currently has an impedance mismatch with the ArrowFileWriter. The Adapter returns an new VectorStreamRoot per batch, and the Writer libraries are designed around loading/unloading a single VectorSchemaRoot. I think the method with the least overhead for transferring is the data is to create a VectorUnloader [6] per VectorSchemaRoot, convert it to a record batch and then load it into the Writer's VectorSchemaRoot. This will unfortunately cause some amount of memory churn due to extra allocations. There is a short overview of working with Arrow generally available at [7] Hope this helps, Micah [1] https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html [2] https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77 [3] https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html [4] https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java [5] https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html [6] https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java [7] https://arrow.apache.org/docs/java/ On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <[email protected]> wrote: > Hi - > > I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc > file or SpecificRecord Java class) that I'd like to send to TensorFlow as > input tensors, preferably via Arrow. Can you suggest some existing > adapters or code patterns (Java or Scala) that I can use? > > Thanks - > > -- Cindy >
