The function call to array produces an array of bytes. So the code is
writing out raw binary data. Given that I can read the data back in from
the output file using the Java API make me think I am writing the data
correctly.

I think the issue here is that the Java BinaryMessageEncoder writes the
data using a special header that consists of two bytes in the beginning
followed by the Avro schema fingerprint. I briefly looked at the Python
avro.io code and did not see where it would look for a fingerprint and try
to do schema resolution. Do you know if the Python code is doing that
somewhere? It looks like the python code is looking for b'Obj' followed by
the number 1 in the header. I only spent about an hour looking at the code
so I admin, I could be way off on this.

Let me know what you think. I will keep digging on my end.

On Fri, May 28, 2021 at 10:39 AM Michael A. Smith <mich...@smith-li.com>
wrote:

> > I created a simple example in Java and wrote some Python to try to read
> the record.
>
> I think the data your java code is producing might not be valid. I
> don't know Java very well, so I can't provide specific advice there,
> but I do know the java implementation comes with a tool that should
> produce a good example:
>
> ```
> $ tail -n 100 preisler.avsc preisler.json
> ==> preisler.avsc <==
> {
>     "type": "record",
>     "name": "simpleMessage",
>     "fields": [
>         {
>             "name": "message",
>             "type": "string"
>         },
>         {
>             "name": "aNumber",
>             "type": "int"
>         }
>     ]
> }
>
> ==> preisler.json <==
> {
>   "message": "Test Message",
>   "aNumber": 365
> }
>
> $ java -jar
> ~/dev/avro/lang/java/tools/target/avro-tools-1.11.0-SNAPSHOT.jar
> jsontofrag --schema-file preisler.avsc preisler.json >
> preisler.avro.frag
> 21/05/28 11:25:43 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
>
> $ base64 preisler.avro.frag  # so you can tell if we're getting the same
> results
> GFRlc3QgTWVzc2FnZdoF
>
> $ python -c 'import avro.io, avro.schema
> print(
>     avro.io.DatumReader(
>         avro.schema.parse(open("preisler.avsc", "rb").read())
>     ).read(
>         avro.io.BinaryDecoder(open("preisler.avro.frag", "rb"))
>     )
> )'
> {'message': 'Test Message', 'aNumber': 365}
> ```
>
> Sorry my java is not better. Is it correct to change the data to
> array() before writing it to a file?
> (
> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java#L50
> )
>
> On Fri, May 28, 2021 at 10:41 AM Chad Preisler <chad.preis...@gmail.com>
> wrote:
> >
> > Here is the schema
> >
> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/avroTestSchema.avsc
> >
> > On Fri, May 28, 2021 at 9:13 AM Michael A. Smith <mich...@smith-li.com>
> wrote:
> >>
> >> Hi, Chad,
> >>
> >> Did you share the schema somewhere? Is that something you're able to
> share?
> >>
> >> On Fri, May 28, 2021 at 10:00 AM Chad Preisler <chad.preis...@gmail.com>
> wrote:
> >> >
> >> > Hi,
> >> > I created a simple example in Java and wrote some Python to try to
> read the record. I am getting the following error when trying to read the
> Java record in Python.
> >> >
> >> > Traceback (most recent call last):
> >> >   File "/home/chad/python/avroReadTest/avro_read_binary_java.py",
> line 18, in <module>
> >> >     message = read_datum(java_binary_data, schema)
> >> >   File "/home/chad/python/avroReadTest/avro_read_binary_java.py",
> line 10, in read_datum
> >> >     return datum_reader.read(decoder)
> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py",
> line 626, in read
> >> >     return self.read_data(self.writers_schema, self.readers_schema,
> decoder)
> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py",
> line 698, in read_data
> >> >     return self.read_record(writers_schema, readers_schema, decoder)
> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py",
> line 898, in read_record
> >> >     field_val = self.read_data(field.type, readers_field.type,
> decoder)
> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py",
> line 655, in read_data
> >> >     return decoder.read_utf8()
> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py",
> line 312, in read_utf8
> >> >     return unicode(self.read_bytes(), "utf-8")
> >> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position
> 2: invalid start byte
> >> >
> >> > Here is a link to the Java code.
> >> >
> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java
> >> >
> >> > I'll admit I'm fairly new to Python. Here is my Python code.
> >> >
> >> > import avro.io
> >> > import avro.schema
> >> > import io
> >> >
> >> >
> >> > def read_datum(buffer, writers_schema, readers_schema=None):
> >> >     reader = io.BytesIO(buffer)
> >> >     decoder = avro.io.BinaryDecoder(reader)
> >> >     datum_reader = avro.io.DatumReader(writers_schema, readers_schema)
> >> >     return datum_reader.read(decoder)
> >> >
> >> >
> >> > java_binary_data =
> open("/home/chad/app_shared_resources/avroBinaryEncoderTest/java_binary_output.avo",
> "rb").read()
> >> > schemaBytes =
> open("/home/chad/app_shared_resources/avroBinaryEncoderTest/avroTestSchema.avsc",
> "rb").read()
> >> > print ("Schema read in: " + schemaBytes.decode('UTF-8'))
> >> > schema = avro.schema.parse(schemaBytes)
> >> > print("Schema " + schema.__str__())
> >> > message = read_datum(java_binary_data, schema)
> >> > print(message)
> >> >
> >> > I appreciate any help getting this working.
> >> >
> >> > Thanks,
> >> > Chad
> >> >
> >> > On Thu, May 27, 2021 at 12:56 PM Michael A. Smith <
> mich...@smith-li.com> wrote:
> >> >>
> >> >> They should be compatible.
> >> >>
> >> >> Take a look at lang/py/avro/test/test_io.py in
> >> >>
> >> >> https://github.com/apache/avro
> >> >>
> >> >> Line 239 has a simple function that lays it out.
> >> >>
> >> >> If you encounter a way in which Java and Python are producing
> incompatible results, please let us know.
> >> >>
> >> >> On Thu, May 27, 2021 at 13:05 Chad Preisler <chad.preis...@gmail.com>
> wrote:
> >> >>>
> >> >>> Hello,
> >> >>>
> >> >>> I am writing messages in Java using the BinaryMessageEncoder. I
> would like to read the message in python. Is this supported, or is the
> format written with BinaryMessageEncoder only supported in Java?
> >> >>>
> >> >>> If it is supported can you point me to a python example that reads
> the binary message format in python?
> >> >>>
> >> >>> Thanks,
> >> >>> Chad
>

Reply via email to