Hi, On Spark 1.3, using Scala 10.4
Given an existing dataframe with two colums (col A = JSON string, col B = int), is it possible to create a new dataframe from col A and automatically generate the schema (similar to when json is loaded/read from file)? Alternately... given an existing dataframe created by loading/reading a json file. After a filter is applied, is it possible to re-generate the schema to only show objects present in the filtered dataframe? The issue is that I have csv files where the first column is a json string and the second column is the object type. The json schema will vary greatly from object type to object type. Currently I can read this into a dataframe as text, but I can't figure out how to create dataframes from the json for a given object (without pre-defining the schemas). I could alter the source structure to just be json, including the object type as a json object... the issue here is that when I create the dataframe the schema includes the objects for all the object types. When I filter this by object type, the schema is still the huge schema representing all object types (so saving as parquet for example, I would end up with a 1000+ empty columns unless I again had a predefined schema for each object type). Any ideas other than: 1)pre-defining object type schema: schema are large and changing 2)splitting the source data by object type: currently working with ~1k files per hour, splitting I'd be working with ~50k files per hour 3)writing out to disk for each object as text and reading back in as JSON: with repartition files could be reduced but there would be more disk io. Any help would be appreciated. Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Inferring-JSON-schema-from-a-JSON-string-in-a-dataframe-column-tp24559.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org