Yes, 2376 has been fixed in master. Can you give it a try? Also, for inferSchema, because Python is dynamically typed, I agree with Davies to provide a way to scan a subset (or entire) of the dataset to figure out the proper schema. We will take a look it.
Thanks, Yin On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller <bmill...@eecs.berkeley.edu> wrote: > Assuming updating to master fixes the bug I was experiencing with jsonRDD > and jsonFile, then pushing "sample" to master will probably not be > necessary. > > We believe that the link below was the bug I experienced, and I've been > told it is fixed in master. > > https://issues.apache.org/jira/browse/SPARK-2376 > > best, > -brad > > > On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu <dav...@databricks.com> wrote: > >> This "sample" argument of inferSchema is still no in master, if will >> try to add it if it make >> sense. >> >> On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller <bmill...@eecs.berkeley.edu> >> wrote: >> > Hi Davies, >> > >> > Thanks for the response and tips. Is the "sample" argument to >> inferSchema >> > available in the 1.0.1 release of pyspark? I'm not sure (based on the >> > documentation linked below) that it is. >> > >> http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema >> > >> > It sounds like updating to master may help address my issue (and may >> also >> > make the "sample" argument available), so I'm going to go ahead and do >> that. >> > >> > best, >> > -Brad >> > >> > >> > On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu <dav...@databricks.com> >> wrote: >> >> >> >> On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas >> >> <nicholas.cham...@gmail.com> wrote: >> >> > I was just about to ask about this. >> >> > >> >> > Currently, there are two methods, sqlContext.jsonFile() and >> >> > sqlContext.jsonRDD(), that work on JSON text and infer a schema that >> >> > covers >> >> > the whole data set. >> >> > >> >> > For example: >> >> > >> >> > from pyspark.sql import SQLContext >> >> > sqlContext = SQLContext(sc) >> >> > >> >> >>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', >> >> >>>> '{"foo":"boom", "baz":[1,2,3]}'])) >> >> >>>> a.printSchema() >> >> > root >> >> > |-- baz: array (nullable = true) >> >> > | |-- element: integer (containsNull = false) >> >> > |-- foo: string (nullable = true) >> >> > >> >> > It works really well! It handles fields with inconsistent value >> types by >> >> > inferring a value type that covers all the possible values. >> >> > >> >> > But say you’ve already deserialized the JSON to do some >> pre-processing >> >> > or >> >> > filtering. You’d commonly want to do this, say, to remove bad data. >> So >> >> > now >> >> > you have an RDD of Python dictionaries, as opposed to an RDD of JSON >> >> > strings. It would be perfect if you could get the completeness of the >> >> > json...() methods, but against dictionaries. >> >> > >> >> > Unfortunately, as you noted, inferSchema() only looks at the first >> >> > element >> >> > in the set. Furthermore, inferring schemata from RDDs of >> dictionaries is >> >> > being deprecated in favor of doing so from RDDs of Rows. >> >> > >> >> > I’m not sure what the intention behind this move is, but as a user >> I’d >> >> > like >> >> > to be able to convert RDDs of dictionaries directly to SchemaRDDs >> with >> >> > the >> >> > completeness of the jsonRDD()/jsonFile() methods. Right now if I >> really >> >> > want >> >> > that, I have to serialize the dictionaries to JSON text and then call >> >> > jsonRDD(), which is expensive. >> >> >> >> Before upcoming 1.1 release, we did not support nested structures via >> >> inferSchema, >> >> the nested dictionary will be MapType. This introduces inconsistance >> >> for dictionary that >> >> the top level will be structure type (can be accessed by name of >> >> field) but others will be >> >> MapType (can be accesses as map). >> >> >> >> So deprecated top level dictionary is try to solve this kind of >> >> inconsistance. >> >> >> >> The Row class in pyspark.sql has a similar interface to dict, so you >> >> can easily convert >> >> you dic into a Row: >> >> >> >> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d))) >> >> >> >> In order to get the correct schema, so we need another argument to >> specify >> >> the number of rows to be infered? Such as: >> >> >> >> inferSchema(rdd, sample=None) >> >> >> >> with sample=None, it will take the first row, or it will do the >> >> sampling to figure out the >> >> complete schema. >> >> >> >> Does this work for you? >> >> >> >> > Nick >> >> > >> >> > >> >> > >> >> > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller < >> bmill...@eecs.berkeley.edu> >> >> > wrote: >> >> >> >> >> >> Hi All, >> >> >> >> >> >> I have a data set where each record is serialized using JSON, and >> I'm >> >> >> interested to use SchemaRDDs to work with the data. Unfortunately >> I've >> >> >> hit >> >> >> a snag since some fields in the data are maps and list, and are not >> >> >> guaranteed to be populated for each record. This seems to cause >> >> >> inferSchema >> >> >> to throw an error: >> >> >> >> >> >> Produces error: >> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]}, >> >> >> {'foo':'boom', 'baz':[1,2,3]}])) >> >> >> >> >> >> Works fine: >> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', >> 'baz':[1,2,3]}, >> >> >> {'foo':'boom', 'baz':[]}])) >> >> >> >> >> >> To be fair inferSchema says it "peeks at the first row", so a >> possible >> >> >> work-around would be to make sure the type of any collection can be >> >> >> determined using the first instance. However, I don't believe that >> >> >> items in >> >> >> an RDD are guaranteed to remain in an ordered, so this approach >> seems >> >> >> somewhat brittle. >> >> >> >> >> >> Does anybody know a robust solution to this problem in PySpark? >> I'm am >> >> >> running the 1.0.1 release. >> >> >> >> >> >> -Brad >> >> >> >> >> > >> > >> > >> > >