> I think the issue here is that the Java BinaryMessageEncoder writes the data 
> using a special header that consists of two bytes in the beginning followed 
> by the Avro schema fingerprint.

That sounds like Single Object Encoding
(https://avro.apache.org/docs/current/spec.html#single_object_encoding).
That's possible, but I'd find it kinda surprising just because I'd
expect the tools jar to use similar code to what you wrote in Java,
and your code doesn't explicitly write the single object encoding
form.

Can you share the entire binary avro that your code produces? You can
run `base64` on the file and put it in the email.

On Fri, May 28, 2021 at 11:58 AM Chad Preisler <chad.preis...@gmail.com> wrote:
>
> The function call to array produces an array of bytes. So the code is writing 
> out raw binary data. Given that I can read the data back in from the output 
> file using the Java API make me think I am writing the data correctly.
>
> I think the issue here is that the Java BinaryMessageEncoder writes the data 
> using a special header that consists of two bytes in the beginning followed 
> by the Avro schema fingerprint. I briefly looked at the Python avro.io code 
> and did not see where it would look for a fingerprint and try to do schema 
> resolution. Do you know if the Python code is doing that somewhere? It looks 
> like the python code is looking for b'Obj' followed by the number 1 in the 
> header. I only spent about an hour looking at the code so I admin, I could be 
> way off on this.
>
> Let me know what you think. I will keep digging on my end.
>
> On Fri, May 28, 2021 at 10:39 AM Michael A. Smith <mich...@smith-li.com> 
> wrote:
>>
>> > I created a simple example in Java and wrote some Python to try to read 
>> > the record.
>>
>> I think the data your java code is producing might not be valid. I
>> don't know Java very well, so I can't provide specific advice there,
>> but I do know the java implementation comes with a tool that should
>> produce a good example:
>>
>> ```
>> $ tail -n 100 preisler.avsc preisler.json
>> ==> preisler.avsc <==
>> {
>>     "type": "record",
>>     "name": "simpleMessage",
>>     "fields": [
>>         {
>>             "name": "message",
>>             "type": "string"
>>         },
>>         {
>>             "name": "aNumber",
>>             "type": "int"
>>         }
>>     ]
>> }
>>
>> ==> preisler.json <==
>> {
>>   "message": "Test Message",
>>   "aNumber": 365
>> }
>>
>> $ java -jar ~/dev/avro/lang/java/tools/target/avro-tools-1.11.0-SNAPSHOT.jar
>> jsontofrag --schema-file preisler.avsc preisler.json >
>> preisler.avro.frag
>> 21/05/28 11:25:43 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>>
>> $ base64 preisler.avro.frag  # so you can tell if we're getting the same 
>> results
>> GFRlc3QgTWVzc2FnZdoF
>>
>> $ python -c 'import avro.io, avro.schema
>> print(
>>     avro.io.DatumReader(
>>         avro.schema.parse(open("preisler.avsc", "rb").read())
>>     ).read(
>>         avro.io.BinaryDecoder(open("preisler.avro.frag", "rb"))
>>     )
>> )'
>> {'message': 'Test Message', 'aNumber': 365}
>> ```
>>
>> Sorry my java is not better. Is it correct to change the data to
>> array() before writing it to a file?
>> (https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java#L50)
>>
>> On Fri, May 28, 2021 at 10:41 AM Chad Preisler <chad.preis...@gmail.com> 
>> wrote:
>> >
>> > Here is the schema
>> > https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/avroTestSchema.avsc
>> >
>> > On Fri, May 28, 2021 at 9:13 AM Michael A. Smith <mich...@smith-li.com> 
>> > wrote:
>> >>
>> >> Hi, Chad,
>> >>
>> >> Did you share the schema somewhere? Is that something you're able to 
>> >> share?
>> >>
>> >> On Fri, May 28, 2021 at 10:00 AM Chad Preisler <chad.preis...@gmail.com> 
>> >> wrote:
>> >> >
>> >> > Hi,
>> >> > I created a simple example in Java and wrote some Python to try to read 
>> >> > the record. I am getting the following error when trying to read the 
>> >> > Java record in Python.
>> >> >
>> >> > Traceback (most recent call last):
>> >> >   File "/home/chad/python/avroReadTest/avro_read_binary_java.py", line 
>> >> > 18, in <module>
>> >> >     message = read_datum(java_binary_data, schema)
>> >> >   File "/home/chad/python/avroReadTest/avro_read_binary_java.py", line 
>> >> > 10, in read_datum
>> >> >     return datum_reader.read(decoder)
>> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 
>> >> > 626, in read
>> >> >     return self.read_data(self.writers_schema, self.readers_schema, 
>> >> > decoder)
>> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 
>> >> > 698, in read_data
>> >> >     return self.read_record(writers_schema, readers_schema, decoder)
>> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 
>> >> > 898, in read_record
>> >> >     field_val = self.read_data(field.type, readers_field.type, decoder)
>> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 
>> >> > 655, in read_data
>> >> >     return decoder.read_utf8()
>> >> >   File "/home/chad/.local/lib/python3.8/site-packages/avro/io.py", line 
>> >> > 312, in read_utf8
>> >> >     return unicode(self.read_bytes(), "utf-8")
>> >> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: 
>> >> > invalid start byte
>> >> >
>> >> > Here is a link to the Java code.
>> >> > https://gitlab.com/chad.preisler/avrojavabinaryencoderexample/-/blob/main/src/main/java/chad/preisler/avro/eamples/AvroWriteReadBinary.java
>> >> >
>> >> > I'll admit I'm fairly new to Python. Here is my Python code.
>> >> >
>> >> > import avro.io
>> >> > import avro.schema
>> >> > import io
>> >> >
>> >> >
>> >> > def read_datum(buffer, writers_schema, readers_schema=None):
>> >> >     reader = io.BytesIO(buffer)
>> >> >     decoder = avro.io.BinaryDecoder(reader)
>> >> >     datum_reader = avro.io.DatumReader(writers_schema, readers_schema)
>> >> >     return datum_reader.read(decoder)
>> >> >
>> >> >
>> >> > java_binary_data = 
>> >> > open("/home/chad/app_shared_resources/avroBinaryEncoderTest/java_binary_output.avo",
>> >> >  "rb").read()
>> >> > schemaBytes = 
>> >> > open("/home/chad/app_shared_resources/avroBinaryEncoderTest/avroTestSchema.avsc",
>> >> >  "rb").read()
>> >> > print ("Schema read in: " + schemaBytes.decode('UTF-8'))
>> >> > schema = avro.schema.parse(schemaBytes)
>> >> > print("Schema " + schema.__str__())
>> >> > message = read_datum(java_binary_data, schema)
>> >> > print(message)
>> >> >
>> >> > I appreciate any help getting this working.
>> >> >
>> >> > Thanks,
>> >> > Chad
>> >> >
>> >> > On Thu, May 27, 2021 at 12:56 PM Michael A. Smith 
>> >> > <mich...@smith-li.com> wrote:
>> >> >>
>> >> >> They should be compatible.
>> >> >>
>> >> >> Take a look at lang/py/avro/test/test_io.py in
>> >> >>
>> >> >> https://github.com/apache/avro
>> >> >>
>> >> >> Line 239 has a simple function that lays it out.
>> >> >>
>> >> >> If you encounter a way in which Java and Python are producing 
>> >> >> incompatible results, please let us know.
>> >> >>
>> >> >> On Thu, May 27, 2021 at 13:05 Chad Preisler <chad.preis...@gmail.com> 
>> >> >> wrote:
>> >> >>>
>> >> >>> Hello,
>> >> >>>
>> >> >>> I am writing messages in Java using the BinaryMessageEncoder. I would 
>> >> >>> like to read the message in python. Is this supported, or is the 
>> >> >>> format written with BinaryMessageEncoder only supported in Java?
>> >> >>>
>> >> >>> If it is supported can you point me to a python example that reads 
>> >> >>> the binary message format in python?
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Chad

Reply via email to