Re: Reading messages written with BinaryMessageEncoder in python

Chad Preisler Tue, 01 Jun 2021 09:11:01 -0700

If I strip off the header then the Python code can deserialize the Java
message. Is there a function/class in avro.io that will strip the header
for me?


Here is the new Python code.

import avro.io
import avro.schema
import io


def read_datum(buffer, writers_schema, readers_schema=None):
reader = io.BytesIO(buffer)
decoder = avro.io.BinaryDecoder(reader)
datum_reader = avro.io.DatumReader(writers_schema, readers_schema)
return datum_reader.read(decoder)

java_binary_data =
open("/home/chad/app_shared_resources/avroBinaryEncoderTest/java_binary_output.avo",
"rb").read()
schemaBytes =
open("/home/chad/app_shared_resources/avroBinaryEncoderTest/avroTestSchema.avsc",
"rb").read()
print ("Schema read in: " + schemaBytes.decode('UTF-8'))
schema = avro.schema.parse(schemaBytes)
print("Schema " + schema.__str__())
if (java_binary_data[0:2] == b'\xC3\x01'):
  print("need to strip header")
  java_binary_data = java_binary_data[10:]

message = read_datum(java_binary_data, schema)
print(message)




On Fri, May 28, 2021 at 3:05 PM Chad Preisler <[email protected]>
wrote:

> Here is the content of the file base64 encoded.
>
> wwHDssAxUVqSKxhUZXN0IE1lc3NhZ2XaBQ==
>
> On Fri, May 28, 2021 at 12:45 PM Michael A. Smith <[email protected]>
> wrote:
>
>> > I think the issue here is that the Java BinaryMessageEncoder writes the
>> data using a special header that consists of two bytes in the beginning
>> followed by the Avro schema fingerprint.
>>
>> That sounds like Single Object Encoding
>> (https://avro.apache.org/docs/current/spec.html#single_object_encoding).
>> That's possible, but I'd find it kinda surprising just because I'd
>> expect the tools jar to use similar code to what you wrote in Java,
>> and your code doesn't explicitly write the single object encoding
>> form.
>>
>> Can you share the entire binary avro that your code produces? You can
>> run `base64` on the file and put it in the email.
>>
>> On Fri, May 28, 2021 at 11:58 AM Chad Preisler <[email protected]>
>> wrote:
>> >
>> > The function call to array produces an array of bytes. So the code is
>> writing out raw binary data. Given that I can read the data back in from
>> the output file using the Java API make me think I am writing the data
>> correctly.
>> >
>> > I think the issue here is that the Java BinaryMessageEncoder writes the
>> data using a special header that consists of two bytes in the beginning
>> followed by the Avro schema fingerprint. I briefly looked at the Python
>> avro.io code and did not see where it would look for a fingerprint and
>> try to do schema resolution. Do you know if the Python code is doing that
>> somewhere? It looks like the python code is looking for b'Obj' followed by
>> the number 1 in the header. I only spent about an hour looking at the code
>> so I admin, I could be way off on this.
>> >
>> > Let me know what you think. I will keep digging on my end.
>> >
>> > On Fri, May 28, 2021 at 10:39 AM Michael A. Smith <[email protected]>
>> wrote:
>> >>
>> >> > I created a simple example in Java and wrote some Python to try to
>> read the record.
>> >>
>> >> I think the data your java code is producing might not be valid. I
>> >> don't know Java very well, so I can't provide specific advice there,
>> >> but I do know the java implementation comes with a tool that should
>> >> produce a good example:
>> >>
>> >> ```
>> >> $ tail -n 100 preisler.avsc preisler.json
>> >> ==> preisler.avsc <==
>> >> {
>> >>     "type": "record",
>> >>     "name": "simpleMessage",
>> >>     "fields": [
>> >>         {
>> >>             "name": "message",
>> >>             "type": "string"
>> >>         },
>> >>         {
>> >>             "name": "aNumber",
>> >>             "type": "int"
>> >>         }
>> >>     ]
>> >> }
>> >>
>> >> ==> preisler.json <==
>> >> {
>> >>   "message": "Test Message",
>> >>   "aNumber": 365
>> >> }
>> >>
>> >> $ java -jar
>> ~/dev/avro/lang/java/tools/target/avro-tools-1.11.0-SNAPSHOT.jar
>> >> jsontofrag --schema-file preisler.avsc preisler.json >
>> >> preisler.avro.frag
>> >> 21/05/28 11:25:43 WARN util.NativeCodeLoader: Unable to load
>> >> native-hadoop library for your platform... using builtin-java classes
>> >> where applicable
>> >>
>> >> $ base64 preisler.avro.frag  # so you can tell if we're getting the
>> same results
>> >> GFRlc3QgTWVzc2FnZdoF
>> >>
>> >> $ python -c 'import avro.io, avro.schema
>> >> print(
>> >>     avro.io.DatumReader(
>> >>         avro.schema.parse(open("preisler.avsc", "rb").read())
>> >>     ).read(
>> >>         avro.io.BinaryDecoder(open("preisler.avro.frag", "rb"))
>> >>     )
>> >> )'
>> >> {'message': 'Test Message', 'aNumber': 365}
>> >> ```
>> >>
>> >> Sorry my java is not better. Is it correct to change the data to
>> >> array() before writing it to a file?
>> >> (
>> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java#L50
>> )
>> >>
>> >> On Fri, May 28, 2021 at 10:41 AM Chad Preisler <
>> [email protected]> wrote:
>> >> >
>> >> > Here is the schema
>> >> >
>> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/avroTestSchema.avsc
>> >> >
>> >> > On Fri, May 28, 2021 at 9:13 AM Michael A. Smith <
>> [email protected]> wrote:
>> >> >>
>> >> >> Hi, Chad,
>> >> >>
>> >> >> Did you share the schema somewhere? Is that something you're able
>> to share?
>> >> >>
>> >> >> On Fri, May 28, 2021 at 10:00 AM Chad Preisler <
>> [email protected]> wrote:
>> >> >> >
>> >> >> > Hi,
>> >> >> > I created a simple example in Java and wrote some Python to try
>> to read the record. I am getting the following error when trying to read
>> the Java record in Python.
>> >> >> >
>> >> >> > Traceback (most recent call last):
>> >> >> >   File "/home/chad/python/avroReadTest/avro_read_binary_java.py",
>> line 18, in <module>
>> >> >> >     message = read_datum(java_binary_data, schema)
>> >> >> >   File "/home/chad/python/avroReadTest/avro_read_binary_java.py",
>> line 10, in read_datum
>> >> >> >     return datum_reader.read(decoder)
>> >> >> >   File
>> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 626, in
>> read
>> >> >> >     return self.read_data(self.writers_schema,
>> self.readers_schema, decoder)
>> >> >> >   File
>> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 698, in
>> read_data
>> >> >> >     return self.read_record(writers_schema, readers_schema,
>> decoder)
>> >> >> >   File
>> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 898, in
>> read_record
>> >> >> >     field_val = self.read_data(field.type, readers_field.type,
>> decoder)
>> >> >> >   File
>> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 655, in
>> read_data
>> >> >> >     return decoder.read_utf8()
>> >> >> >   File
>> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 312, in
>> read_utf8
>> >> >> >     return unicode(self.read_bytes(), "utf-8")
>> >> >> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in
>> position 2: invalid start byte
>> >> >> >
>> >> >> > Here is a link to the Java code.
>> >> >> >
>> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java
>> >> >> >
>> >> >> > I'll admit I'm fairly new to Python. Here is my Python code.
>> >> >> >
>> >> >> > import avro.io
>> >> >> > import avro.schema
>> >> >> > import io
>> >> >> >
>> >> >> >
>> >> >> > def read_datum(buffer, writers_schema, readers_schema=None):
>> >> >> >     reader = io.BytesIO(buffer)
>> >> >> >     decoder = avro.io.BinaryDecoder(reader)
>> >> >> >     datum_reader = avro.io.DatumReader(writers_schema,
>> readers_schema)
>> >> >> >     return datum_reader.read(decoder)
>> >> >> >
>> >> >> >
>> >> >> > java_binary_data =
>> open("/home/chad/app_shared_resources/avroBinaryEncoderTest/java_binary_output.avo",
>> "rb").read()
>> >> >> > schemaBytes =
>> open("/home/chad/app_shared_resources/avroBinaryEncoderTest/avroTestSchema.avsc",
>> "rb").read()
>> >> >> > print ("Schema read in: " + schemaBytes.decode('UTF-8'))
>> >> >> > schema = avro.schema.parse(schemaBytes)
>> >> >> > print("Schema " + schema.__str__())
>> >> >> > message = read_datum(java_binary_data, schema)
>> >> >> > print(message)
>> >> >> >
>> >> >> > I appreciate any help getting this working.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Chad
>> >> >> >
>> >> >> > On Thu, May 27, 2021 at 12:56 PM Michael A. Smith <
>> [email protected]> wrote:
>> >> >> >>
>> >> >> >> They should be compatible.
>> >> >> >>
>> >> >> >> Take a look at lang/py/avro/test/test_io.py in
>> >> >> >>
>> >> >> >> https://github.com/apache/avro
>> >> >> >>
>> >> >> >> Line 239 has a simple function that lays it out.
>> >> >> >>
>> >> >> >> If you encounter a way in which Java and Python are producing
>> incompatible results, please let us know.
>> >> >> >>
>> >> >> >> On Thu, May 27, 2021 at 13:05 Chad Preisler <
>> [email protected]> wrote:
>> >> >> >>>
>> >> >> >>> Hello,
>> >> >> >>>
>> >> >> >>> I am writing messages in Java using the BinaryMessageEncoder. I
>> would like to read the message in python. Is this supported, or is the
>> format written with BinaryMessageEncoder only supported in Java?
>> >> >> >>>
>> >> >> >>> If it is supported can you point me to a python example that
>> reads the binary message format in python?
>> >> >> >>>
>> >> >> >>> Thanks,
>> >> >> >>> Chad
>>
>

Re: Reading messages written with BinaryMessageEncoder in python

Reply via email to