We're exploring using replacing json with Avro as our data storage format.
Our schema is fairly messy, deeply nested, and has several nullable fields.
I'm writing some code to run a mapreduce step to convert from json to avro
format.

I managed to get a working prototype on a simple schema, and now I'm trying
to use the real schema, and running into a case where a Python prototype
manages to convert a record successfully, while the Java prototype throws
org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_STRING.

Java code:

public void driver() throws Exception {
    byte[] encoded = Files.readAllBytes(Paths.get("json_file"));
    String string =  new String(encoded, StandardCharsets.UTF_8);
    Cluster c = deserializer(string, Cluster.getClassSchema());
}

public Cluster deserializer(String value, Schema schema) throws IOException
{
    InputStream stream = IOUtils.toInputStream(value);
    SpecificDatumReader<Cluster> reader = new SpecificDatumReader<>(schema);
    JsonDecoder decoder = DecoderFactory.get().jsonDecoder(schema, stream);
    return reader.read(null, decoder);

Python code:

import sys
import json
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open('Cluster.avsc', "rb").read())
rec_writer = DatumWriter(schema)
df_writer  = DataFileWriter(open("users.avro", "wb"), rec_writer, schema)

for line in sys.stdin:
  cluster_dict = json.loads(data)
  df_writer.append(cluster_dict)
df_writer.close()

The input json is untagged. Here's a representative subset:
{
  "start_time": 1486000000000,
  "total_place_count": 0,
  "appliedDemoMdl": false,
  "longitude": -99.990911,
  "significant_place": null,
  "version": "inferencecore-1.123.0",
  "end_time": 1486070000000,
  "latitude": 11.111182,
}

My main question is why are my two implementations behaving differently? Is
what I'm trying to do not really possible without writing my own object
mapper from the json representation?

Brenden

Reply via email to