Hi All,
I have a data set where each record is serialized using JSON, and I'm
interested to use SchemaRDDs to work with the data. Unfortunately I've hit
a snag since some fields in the data are maps and list, and are not
guaranteed to be populated for each record. This seems to cause
inferSchema to throw an error:
Produces error:
srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
{'foo':'boom', 'baz':[1,2,3]}]))
Works fine:
srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]},
{'foo':'boom', 'baz':[]}]))
To be fair inferSchema says it "peeks at the first row", so a possible
work-around would be to make sure the type of any collection can be
determined using the first instance. However, I don't believe that items
in an RDD are guaranteed to remain in an ordered, so this approach seems
somewhat brittle.
Does anybody know a robust solution to this problem in PySpark? I'm am
running the 1.0.1 release.
-Brad