If I strip off the header then the Python code can deserialize the Java message. Is there a function/class in avro.io that will strip the header for me?
Here is the new Python code. import avro.io import avro.schema import io def read_datum(buffer, writers_schema, readers_schema=None): reader = io.BytesIO(buffer) decoder = avro.io.BinaryDecoder(reader) datum_reader = avro.io.DatumReader(writers_schema, readers_schema) return datum_reader.read(decoder) java_binary_data = open("/home/chad/app_shared_resources/avroBinaryEncoderTest/java_binary_output.avo", "rb").read() schemaBytes = open("/home/chad/app_shared_resources/avroBinaryEncoderTest/avroTestSchema.avsc", "rb").read() print ("Schema read in: " + schemaBytes.decode('UTF-8')) schema = avro.schema.parse(schemaBytes) print("Schema " + schema.__str__()) if (java_binary_data[0:2] == b'\xC3\x01'): print("need to strip header") java_binary_data = java_binary_data[10:] message = read_datum(java_binary_data, schema) print(message) On Fri, May 28, 2021 at 3:05 PM Chad Preisler <chad.preis...@gmail.com> wrote: > Here is the content of the file base64 encoded. > > wwHDssAxUVqSKxhUZXN0IE1lc3NhZ2XaBQ== > > On Fri, May 28, 2021 at 12:45 PM Michael A. Smith <mich...@smith-li.com> > wrote: > >> > I think the issue here is that the Java BinaryMessageEncoder writes the >> data using a special header that consists of two bytes in the beginning >> followed by the Avro schema fingerprint. >> >> That sounds like Single Object Encoding >> (https://avro.apache.org/docs/current/spec.html#single_object_encoding). >> That's possible, but I'd find it kinda surprising just because I'd >> expect the tools jar to use similar code to what you wrote in Java, >> and your code doesn't explicitly write the single object encoding >> form. >> >> Can you share the entire binary avro that your code produces? You can >> run `base64` on the file and put it in the email. >> >> On Fri, May 28, 2021 at 11:58 AM Chad Preisler <chad.preis...@gmail.com> >> wrote: >> > >> > The function call to array produces an array of bytes. So the code is >> writing out raw binary data. Given that I can read the data back in from >> the output file using the Java API make me think I am writing the data >> correctly. >> > >> > I think the issue here is that the Java BinaryMessageEncoder writes the >> data using a special header that consists of two bytes in the beginning >> followed by the Avro schema fingerprint. I briefly looked at the Python >> avro.io code and did not see where it would look for a fingerprint and >> try to do schema resolution. Do you know if the Python code is doing that >> somewhere? It looks like the python code is looking for b'Obj' followed by >> the number 1 in the header. I only spent about an hour looking at the code >> so I admin, I could be way off on this. >> > >> > Let me know what you think. I will keep digging on my end. >> > >> > On Fri, May 28, 2021 at 10:39 AM Michael A. Smith <mich...@smith-li.com> >> wrote: >> >> >> >> > I created a simple example in Java and wrote some Python to try to >> read the record. >> >> >> >> I think the data your java code is producing might not be valid. I >> >> don't know Java very well, so I can't provide specific advice there, >> >> but I do know the java implementation comes with a tool that should >> >> produce a good example: >> >> >> >> ``` >> >> $ tail -n 100 preisler.avsc preisler.json >> >> ==> preisler.avsc <== >> >> { >> >> "type": "record", >> >> "name": "simpleMessage", >> >> "fields": [ >> >> { >> >> "name": "message", >> >> "type": "string" >> >> }, >> >> { >> >> "name": "aNumber", >> >> "type": "int" >> >> } >> >> ] >> >> } >> >> >> >> ==> preisler.json <== >> >> { >> >> "message": "Test Message", >> >> "aNumber": 365 >> >> } >> >> >> >> $ java -jar >> ~/dev/avro/lang/java/tools/target/avro-tools-1.11.0-SNAPSHOT.jar >> >> jsontofrag --schema-file preisler.avsc preisler.json > >> >> preisler.avro.frag >> >> 21/05/28 11:25:43 WARN util.NativeCodeLoader: Unable to load >> >> native-hadoop library for your platform... using builtin-java classes >> >> where applicable >> >> >> >> $ base64 preisler.avro.frag # so you can tell if we're getting the >> same results >> >> GFRlc3QgTWVzc2FnZdoF >> >> >> >> $ python -c 'import avro.io, avro.schema >> >> print( >> >> avro.io.DatumReader( >> >> avro.schema.parse(open("preisler.avsc", "rb").read()) >> >> ).read( >> >> avro.io.BinaryDecoder(open("preisler.avro.frag", "rb")) >> >> ) >> >> )' >> >> {'message': 'Test Message', 'aNumber': 365} >> >> ``` >> >> >> >> Sorry my java is not better. Is it correct to change the data to >> >> array() before writing it to a file? >> >> ( >> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java#L50 >> ) >> >> >> >> On Fri, May 28, 2021 at 10:41 AM Chad Preisler < >> chad.preis...@gmail.com> wrote: >> >> > >> >> > Here is the schema >> >> > >> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/avroTestSchema.avsc >> >> > >> >> > On Fri, May 28, 2021 at 9:13 AM Michael A. Smith < >> mich...@smith-li.com> wrote: >> >> >> >> >> >> Hi, Chad, >> >> >> >> >> >> Did you share the schema somewhere? Is that something you're able >> to share? >> >> >> >> >> >> On Fri, May 28, 2021 at 10:00 AM Chad Preisler < >> chad.preis...@gmail.com> wrote: >> >> >> > >> >> >> > Hi, >> >> >> > I created a simple example in Java and wrote some Python to try >> to read the record. I am getting the following error when trying to read >> the Java record in Python. >> >> >> > >> >> >> > Traceback (most recent call last): >> >> >> > File "/home/chad/python/avroReadTest/avro_read_binary_java.py", >> line 18, in <module> >> >> >> > message = read_datum(java_binary_data, schema) >> >> >> > File "/home/chad/python/avroReadTest/avro_read_binary_java.py", >> line 10, in read_datum >> >> >> > return datum_reader.read(decoder) >> >> >> > File >> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 626, in >> read >> >> >> > return self.read_data(self.writers_schema, >> self.readers_schema, decoder) >> >> >> > File >> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 698, in >> read_data >> >> >> > return self.read_record(writers_schema, readers_schema, >> decoder) >> >> >> > File >> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 898, in >> read_record >> >> >> > field_val = self.read_data(field.type, readers_field.type, >> decoder) >> >> >> > File >> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 655, in >> read_data >> >> >> > return decoder.read_utf8() >> >> >> > File >> "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 312, in >> read_utf8 >> >> >> > return unicode(self.read_bytes(), "utf-8") >> >> >> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in >> position 2: invalid start byte >> >> >> > >> >> >> > Here is a link to the Java code. >> >> >> > >> https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java >> >> >> > >> >> >> > I'll admit I'm fairly new to Python. Here is my Python code. >> >> >> > >> >> >> > import avro.io >> >> >> > import avro.schema >> >> >> > import io >> >> >> > >> >> >> > >> >> >> > def read_datum(buffer, writers_schema, readers_schema=None): >> >> >> > reader = io.BytesIO(buffer) >> >> >> > decoder = avro.io.BinaryDecoder(reader) >> >> >> > datum_reader = avro.io.DatumReader(writers_schema, >> readers_schema) >> >> >> > return datum_reader.read(decoder) >> >> >> > >> >> >> > >> >> >> > java_binary_data = >> open("/home/chad/app_shared_resources/avroBinaryEncoderTest/java_binary_output.avo", >> "rb").read() >> >> >> > schemaBytes = >> open("/home/chad/app_shared_resources/avroBinaryEncoderTest/avroTestSchema.avsc", >> "rb").read() >> >> >> > print ("Schema read in: " + schemaBytes.decode('UTF-8')) >> >> >> > schema = avro.schema.parse(schemaBytes) >> >> >> > print("Schema " + schema.__str__()) >> >> >> > message = read_datum(java_binary_data, schema) >> >> >> > print(message) >> >> >> > >> >> >> > I appreciate any help getting this working. >> >> >> > >> >> >> > Thanks, >> >> >> > Chad >> >> >> > >> >> >> > On Thu, May 27, 2021 at 12:56 PM Michael A. Smith < >> mich...@smith-li.com> wrote: >> >> >> >> >> >> >> >> They should be compatible. >> >> >> >> >> >> >> >> Take a look at lang/py/avro/test/test_io.py in >> >> >> >> >> >> >> >> https://github.com/apache/avro >> >> >> >> >> >> >> >> Line 239 has a simple function that lays it out. >> >> >> >> >> >> >> >> If you encounter a way in which Java and Python are producing >> incompatible results, please let us know. >> >> >> >> >> >> >> >> On Thu, May 27, 2021 at 13:05 Chad Preisler < >> chad.preis...@gmail.com> wrote: >> >> >> >>> >> >> >> >>> Hello, >> >> >> >>> >> >> >> >>> I am writing messages in Java using the BinaryMessageEncoder. I >> would like to read the message in python. Is this supported, or is the >> format written with BinaryMessageEncoder only supported in Java? >> >> >> >>> >> >> >> >>> If it is supported can you point me to a python example that >> reads the binary message format in python? >> >> >> >>> >> >> >> >>> Thanks, >> >> >> >>> Chad >> >