Re: pyspark inferSchema

Davies Liu Tue, 05 Aug 2014 12:19:44 -0700

This "sample" argument of inferSchema is still no in master, if will
try to add it if it make
sense.


On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller <bmill...@eecs.berkeley.edu> wrote:
> Hi Davies,
>
> Thanks for the response and tips.  Is the "sample" argument to inferSchema
> available in the 1.0.1 release of pyspark?  I'm not sure (based on the
> documentation linked below) that it is.
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema
>
> It sounds like updating to master may help address my issue (and may also
> make the "sample" argument available), so I'm going to go ahead and do that.
>
> best,
> -Brad
>
>
> On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu <dav...@databricks.com> wrote:
>>
>> On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>> > I was just about to ask about this.
>> >
>> > Currently, there are two methods, sqlContext.jsonFile() and
>> > sqlContext.jsonRDD(), that work on JSON text and infer a schema that
>> > covers
>> > the whole data set.
>> >
>> > For example:
>> >
>> > from pyspark.sql import SQLContext
>> > sqlContext = SQLContext(sc)
>> >
>> >>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}',
>> >>>> '{"foo":"boom", "baz":[1,2,3]}']))
>> >>>> a.printSchema()
>> > root
>> >  |-- baz: array (nullable = true)
>> >  |    |-- element: integer (containsNull = false)
>> >  |-- foo: string (nullable = true)
>> >
>> > It works really well! It handles fields with inconsistent value types by
>> > inferring a value type that covers all the possible values.
>> >
>> > But say you’ve already deserialized the JSON to do some pre-processing
>> > or
>> > filtering. You’d commonly want to do this, say, to remove bad data. So
>> > now
>> > you have an RDD of Python dictionaries, as opposed to an RDD of JSON
>> > strings. It would be perfect if you could get the completeness of the
>> > json...() methods, but against dictionaries.
>> >
>> > Unfortunately, as you noted, inferSchema() only looks at the first
>> > element
>> > in the set. Furthermore, inferring schemata from RDDs of dictionaries is
>> > being deprecated in favor of doing so from RDDs of Rows.
>> >
>> > I’m not sure what the intention behind this move is, but as a user I’d
>> > like
>> > to be able to convert RDDs of dictionaries directly to SchemaRDDs with
>> > the
>> > completeness of the jsonRDD()/jsonFile() methods. Right now if I really
>> > want
>> > that, I have to serialize the dictionaries to JSON text and then call
>> > jsonRDD(), which is expensive.
>>
>> Before upcoming 1.1 release, we did not support nested structures via
>> inferSchema,
>> the nested dictionary will be MapType. This introduces inconsistance
>> for dictionary that
>> the top level will be structure type (can be accessed by name of
>> field) but others will be
>> MapType (can be accesses as map).
>>
>> So deprecated top level dictionary is try to solve this kind of
>> inconsistance.
>>
>> The Row class in pyspark.sql has a similar interface to dict, so you
>> can easily convert
>> you dic into a Row:
>>
>> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d)))
>>
>> In order to get the correct schema, so we need another argument to specify
>> the number of rows to be infered? Such as:
>>
>> inferSchema(rdd, sample=None)
>>
>> with sample=None, it will take the first row, or it will do the
>> sampling to figure out the
>> complete schema.
>>
>> Does this work for you?
>>
>> > Nick
>> >
>> >
>> >
>> > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <bmill...@eecs.berkeley.edu>
>> > wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I have a data set where each record is serialized using JSON, and I'm
>> >> interested to use SchemaRDDs to work with the data.  Unfortunately I've
>> >> hit
>> >> a snag since some fields in the data are maps and list, and are not
>> >> guaranteed to be populated for each record.  This seems to cause
>> >> inferSchema
>> >> to throw an error:
>> >>
>> >> Produces error:
>> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
>> >> {'foo':'boom', 'baz':[1,2,3]}]))
>> >>
>> >> Works fine:
>> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]},
>> >> {'foo':'boom', 'baz':[]}]))
>> >>
>> >> To be fair inferSchema says it "peeks at the first row", so a possible
>> >> work-around would be to make sure the type of any collection can be
>> >> determined using the first instance.  However, I don't believe that
>> >> items in
>> >> an RDD are guaranteed to remain in an ordered, so this approach seems
>> >> somewhat brittle.
>> >>
>> >> Does anybody know a robust solution to this problem in PySpark?  I'm am
>> >> running the 1.0.1 release.
>> >>
>> >> -Brad
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark inferSchema

Reply via email to