Inferring JSON schema from a JSON string in a dataframe column

mstang Wed, 02 Sep 2015 09:27:54 -0700

Hi,

On Spark 1.3, using Scala 10.4

Given an existing dataframe with two colums (col A = JSON string, col B =
int), is it possible to create a new dataframe from col A and automatically
generate the schema (similar to when json is loaded/read from file)?

Alternately... given an existing dataframe created by loading/reading a json
file. After a filter is applied, is it possible to re-generate the schema
to only show objects present in the filtered dataframe?

The issue is that I have csv files where the first column is a json string
and the second column is the object type. The json schema will vary greatly
from object type to object type.

Currently I can read this into a dataframe as text, but I can't figure out
how to create dataframes from the json for a given object (without
pre-defining the schemas).

I could alter the source structure to just be json, including the object
type as a json object... the issue here is that when I create the dataframe
the schema includes the objects for all the object types. When I filter
this by object type, the schema is still the huge schema representing all
object types (so saving as parquet for example, I would end up with a 1000+
empty columns unless I again had a predefined schema for each object type).

Any ideas other than:
1)pre-defining object type schema: schema are large and changing
2)splitting the source data by object type: currently working with ~1k
files per hour, splitting I'd be working with ~50k files per hour
3)writing out to disk for each object as text and reading back in as JSON:
with repartition files could be reduced but there would be more disk io.

Any help would be appreciated.

Mike

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Inferring-JSON-schema-from-a-JSON-string-in-a-dataframe-column-tp24559.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Inferring JSON schema from a JSON string in a dataframe column

Reply via email to