We're exploring using replacing json with Avro as our data storage format.
Our schema is fairly messy, deeply nested, and has several nullable fields.
I'm writing some code to run a mapreduce step to convert from json to avro
format.
I managed to get a working prototype on a simple schema, and now I'm trying
to use the real schema, and running into a case where a Python prototype
manages to convert a record successfully, while the Java prototype throws
org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_STRING.
Java code:
public void driver() throws Exception {
byte[] encoded = Files.readAllBytes(Paths.get("json_file"));
String string = new String(encoded, StandardCharsets.UTF_8);
Cluster c = deserializer(string, Cluster.getClassSchema());
}
public Cluster deserializer(String value, Schema schema) throws IOException
{
InputStream stream = IOUtils.toInputStream(value);
SpecificDatumReader<Cluster> reader = new SpecificDatumReader<>(schema);
JsonDecoder decoder = DecoderFactory.get().jsonDecoder(schema, stream);
return reader.read(null, decoder);
Python code:
import sys
import json
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
schema = avro.schema.parse(open('Cluster.avsc', "rb").read())
rec_writer = DatumWriter(schema)
df_writer = DataFileWriter(open("users.avro", "wb"), rec_writer, schema)
for line in sys.stdin:
cluster_dict = json.loads(data)
df_writer.append(cluster_dict)
df_writer.close()
The input json is untagged. Here's a representative subset:
{
"start_time": 1486000000000,
"total_place_count": 0,
"appliedDemoMdl": false,
"longitude": -99.990911,
"significant_place": null,
"version": "inferencecore-1.123.0",
"end_time": 1486070000000,
"latitude": 11.111182,
}
My main question is why are my two implementations behaving differently? Is
what I'm trying to do not really possible without writing my own object
mapper from the json representation?
Brenden