Re: pyspark inferSchema

Yin Huai Tue, 05 Aug 2014 12:56:46 -0700

Yes, 2376 has been fixed in master. Can you give it a try?

Also, for inferSchema, because Python is dynamically typed, I agree with
Davies to provide a way to scan a subset (or entire) of the dataset to
figure out the proper schema. We will take a look it.


Thanks,

Yin


On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller <bmill...@eecs.berkeley.edu>
wrote:

> Assuming updating to master fixes the bug I was experiencing with jsonRDD
> and jsonFile, then pushing "sample" to master will probably not be
> necessary.
>
> We believe that the link below was the bug I experienced, and I've been
> told it is fixed in master.
>
> https://issues.apache.org/jira/browse/SPARK-2376
>
> best,
> -brad
>
>
> On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu <dav...@databricks.com> wrote:
>
>> This "sample" argument of inferSchema is still no in master, if will
>> try to add it if it make
>> sense.
>>
>> On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller <bmill...@eecs.berkeley.edu>
>> wrote:
>> > Hi Davies,
>> >
>> > Thanks for the response and tips.  Is the "sample" argument to
>> inferSchema
>> > available in the 1.0.1 release of pyspark?  I'm not sure (based on the
>> > documentation linked below) that it is.
>> >
>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema
>> >
>> > It sounds like updating to master may help address my issue (and may
>> also
>> > make the "sample" argument available), so I'm going to go ahead and do
>> that.
>> >
>> > best,
>> > -Brad
>> >
>> >
>> > On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu <dav...@databricks.com>
>> wrote:
>> >>
>> >> On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas
>> >> <nicholas.cham...@gmail.com> wrote:
>> >> > I was just about to ask about this.
>> >> >
>> >> > Currently, there are two methods, sqlContext.jsonFile() and
>> >> > sqlContext.jsonRDD(), that work on JSON text and infer a schema that
>> >> > covers
>> >> > the whole data set.
>> >> >
>> >> > For example:
>> >> >
>> >> > from pyspark.sql import SQLContext
>> >> > sqlContext = SQLContext(sc)
>> >> >
>> >> >>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}',
>> >> >>>> '{"foo":"boom", "baz":[1,2,3]}']))
>> >> >>>> a.printSchema()
>> >> > root
>> >> >  |-- baz: array (nullable = true)
>> >> >  |    |-- element: integer (containsNull = false)
>> >> >  |-- foo: string (nullable = true)
>> >> >
>> >> > It works really well! It handles fields with inconsistent value
>> types by
>> >> > inferring a value type that covers all the possible values.
>> >> >
>> >> > But say you’ve already deserialized the JSON to do some
>> pre-processing
>> >> > or
>> >> > filtering. You’d commonly want to do this, say, to remove bad data.
>> So
>> >> > now
>> >> > you have an RDD of Python dictionaries, as opposed to an RDD of JSON
>> >> > strings. It would be perfect if you could get the completeness of the
>> >> > json...() methods, but against dictionaries.
>> >> >
>> >> > Unfortunately, as you noted, inferSchema() only looks at the first
>> >> > element
>> >> > in the set. Furthermore, inferring schemata from RDDs of
>> dictionaries is
>> >> > being deprecated in favor of doing so from RDDs of Rows.
>> >> >
>> >> > I’m not sure what the intention behind this move is, but as a user
>> I’d
>> >> > like
>> >> > to be able to convert RDDs of dictionaries directly to SchemaRDDs
>> with
>> >> > the
>> >> > completeness of the jsonRDD()/jsonFile() methods. Right now if I
>> really
>> >> > want
>> >> > that, I have to serialize the dictionaries to JSON text and then call
>> >> > jsonRDD(), which is expensive.
>> >>
>> >> Before upcoming 1.1 release, we did not support nested structures via
>> >> inferSchema,
>> >> the nested dictionary will be MapType. This introduces inconsistance
>> >> for dictionary that
>> >> the top level will be structure type (can be accessed by name of
>> >> field) but others will be
>> >> MapType (can be accesses as map).
>> >>
>> >> So deprecated top level dictionary is try to solve this kind of
>> >> inconsistance.
>> >>
>> >> The Row class in pyspark.sql has a similar interface to dict, so you
>> >> can easily convert
>> >> you dic into a Row:
>> >>
>> >> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d)))
>> >>
>> >> In order to get the correct schema, so we need another argument to
>> specify
>> >> the number of rows to be infered? Such as:
>> >>
>> >> inferSchema(rdd, sample=None)
>> >>
>> >> with sample=None, it will take the first row, or it will do the
>> >> sampling to figure out the
>> >> complete schema.
>> >>
>> >> Does this work for you?
>> >>
>> >> > Nick
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <
>> bmill...@eecs.berkeley.edu>
>> >> > wrote:
>> >> >>
>> >> >> Hi All,
>> >> >>
>> >> >> I have a data set where each record is serialized using JSON, and
>> I'm
>> >> >> interested to use SchemaRDDs to work with the data.  Unfortunately
>> I've
>> >> >> hit
>> >> >> a snag since some fields in the data are maps and list, and are not
>> >> >> guaranteed to be populated for each record.  This seems to cause
>> >> >> inferSchema
>> >> >> to throw an error:
>> >> >>
>> >> >> Produces error:
>> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
>> >> >> {'foo':'boom', 'baz':[1,2,3]}]))
>> >> >>
>> >> >> Works fine:
>> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar',
>> 'baz':[1,2,3]},
>> >> >> {'foo':'boom', 'baz':[]}]))
>> >> >>
>> >> >> To be fair inferSchema says it "peeks at the first row", so a
>> possible
>> >> >> work-around would be to make sure the type of any collection can be
>> >> >> determined using the first instance.  However, I don't believe that
>> >> >> items in
>> >> >> an RDD are guaranteed to remain in an ordered, so this approach
>> seems
>> >> >> somewhat brittle.
>> >> >>
>> >> >> Does anybody know a robust solution to this problem in PySpark?
>>  I'm am
>> >> >> running the 1.0.1 release.
>> >> >>
>> >> >> -Brad
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: pyspark inferSchema

Reply via email to