Hi Brenden, If your schema uses unions (and if you have nullable fields, then it must), then the JSON serialized format that Avro JSON serializers create and deserializers expect (at least in the case of the Java and C++ implementations) explicitly encodes the type of the union in the data. Because the Python Avro library doesn't have its own JSON encoder and decoder, and instead you use the Python json module, that union branch is not expected by the decoder. This is probably why you are not having any issues when deserializing into Python, but you are having issues deserializing with Java.
Thanks, Tim On Tue, Feb 14, 2017 at 6:09 PM, Brenden Brown <[email protected]> wrote: > We're exploring using replacing json with Avro as our data storage format. > Our schema is fairly messy, deeply nested, and has several nullable fields. > I'm writing some code to run a mapreduce step to convert from json to avro > format. > > I managed to get a working prototype on a simple schema, and now I'm > trying to use the real schema, and running into a case where a Python > prototype manages to convert a record successfully, while the Java > prototype throws org.apache.avro.AvroTypeException: Expected start-union. > Got VALUE_STRING. > > Java code: > > public void driver() throws Exception { > byte[] encoded = Files.readAllBytes(Paths.get("json_file")); > String string = new String(encoded, StandardCharsets.UTF_8); > Cluster c = deserializer(string, Cluster.getClassSchema()); > } > > public Cluster deserializer(String value, Schema schema) throws > IOException { > InputStream stream = IOUtils.toInputStream(value); > SpecificDatumReader<Cluster> reader = new > SpecificDatumReader<>(schema); > JsonDecoder decoder = DecoderFactory.get().jsonDecoder(schema, > stream); > return reader.read(null, decoder); > > Python code: > > import sys > import json > import avro.schema > from avro.datafile import DataFileReader, DataFileWriter > from avro.io import DatumReader, DatumWriter > > schema = avro.schema.parse(open('Cluster.avsc', "rb").read()) > rec_writer = DatumWriter(schema) > df_writer = DataFileWriter(open("users.avro", "wb"), rec_writer, schema) > > for line in sys.stdin: > cluster_dict = json.loads(data) > df_writer.append(cluster_dict) > df_writer.close() > > The input json is untagged. Here's a representative subset: > { > "start_time": 1486000000000, > "total_place_count": 0, > "appliedDemoMdl": false, > "longitude": -99.990911, > "significant_place": null, > "version": "inferencecore-1.123.0", > "end_time": 1486070000000, > "latitude": 11.111182, > } > > My main question is why are my two implementations behaving differently? > Is what I'm trying to do not really possible without writing my own object > mapper from the json representation? > > Brenden > >
