Re: pyspark inferSchema

Davies Liu Tue, 05 Aug 2014 12:03:07 -0700

On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> I was just about to ask about this.
>
> Currently, there are two methods, sqlContext.jsonFile() and
> sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers
> the whole data set.
>
> For example:
>
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
>
>>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}',
>>>> '{"foo":"boom", "baz":[1,2,3]}']))
>>>> a.printSchema()
> root
>  |-- baz: array (nullable = true)
>  |    |-- element: integer (containsNull = false)
>  |-- foo: string (nullable = true)
>
> It works really well! It handles fields with inconsistent value types by
> inferring a value type that covers all the possible values.
>
> But say you’ve already deserialized the JSON to do some pre-processing or
> filtering. You’d commonly want to do this, say, to remove bad data. So now
> you have an RDD of Python dictionaries, as opposed to an RDD of JSON
> strings. It would be perfect if you could get the completeness of the
> json...() methods, but against dictionaries.
>
> Unfortunately, as you noted, inferSchema() only looks at the first element
> in the set. Furthermore, inferring schemata from RDDs of dictionaries is
> being deprecated in favor of doing so from RDDs of Rows.
>
> I’m not sure what the intention behind this move is, but as a user I’d like
> to be able to convert RDDs of dictionaries directly to SchemaRDDs with the
> completeness of the jsonRDD()/jsonFile() methods. Right now if I really want
> that, I have to serialize the dictionaries to JSON text and then call
> jsonRDD(), which is expensive.


Before upcoming 1.1 release, we did not support nested structures via
inferSchema,
the nested dictionary will be MapType. This introduces inconsistance
for dictionary that
the top level will be structure type (can be accessed by name of
field) but others will be
MapType (can be accesses as map).

So deprecated top level dictionary is try to solve this kind of inconsistance.

The Row class in pyspark.sql has a similar interface to dict, so you
can easily convert
you dic into a Row:

ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d)))

In order to get the correct schema, so we need another argument to specify
the number of rows to be infered? Such as:

inferSchema(rdd, sample=None)

with sample=None, it will take the first row, or it will do the
sampling to figure out the
complete schema.

Does this work for you?

> Nick
>
>
>
> On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <bmill...@eecs.berkeley.edu>
> wrote:
>>
>> Hi All,
>>
>> I have a data set where each record is serialized using JSON, and I'm
>> interested to use SchemaRDDs to work with the data.  Unfortunately I've hit
>> a snag since some fields in the data are maps and list, and are not
>> guaranteed to be populated for each record.  This seems to cause inferSchema
>> to throw an error:
>>
>> Produces error:
>> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
>> {'foo':'boom', 'baz':[1,2,3]}]))
>>
>> Works fine:
>> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]},
>> {'foo':'boom', 'baz':[]}]))
>>
>> To be fair inferSchema says it "peeks at the first row", so a possible
>> work-around would be to make sure the type of any collection can be
>> determined using the first instance.  However, I don't believe that items in
>> an RDD are guaranteed to remain in an ordered, so this approach seems
>> somewhat brittle.
>>
>> Does anybody know a robust solution to this problem in PySpark?  I'm am
>> running the 1.0.1 release.
>>
>> -Brad
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark inferSchema

Reply via email to