Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Michael Armbrust Wed, 02 Mar 2016 17:09:06 -0800

Note that if you specify the schema that you expect when reading JSON you
basically get the "relaxed" mode that you are asking for.  Records that
don't match will end up with nulls.


The problem here is Spark SQL knows that the operation you are asking for
is invalid given the set of data you let it infer the schema from so it
won't even let you try.

On Wed, Mar 2, 2016 at 11:31 AM, Reynold Xin <r...@databricks.com> wrote:

> Are you looking for "relaxed" mode that simply return nulls for fields
> that doesn't exist or have incompatible schema?
>
>
> On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
>> Thanks Michael, it's not a great example really, as the data I'm working 
>> with has some source files that do fit the schema, and some that don't (out 
>> of millions that do work, perhaps 10 might not).
>>
>> In an ideal world for us the select would probably return the valid records 
>> only.
>>
>> We're trying out the new dataset APIs to see if we can do some pre-filtering 
>> that way.
>>
>> Thanks,
>> Ewan
>>
>> -dev +user
>>
>> StructType(StructField(data,ArrayType(StructType(StructField(
>>> *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
>>> StructField(name,StringType,true)),true),true), StructField(othertype,
>>> ArrayType(StructType(StructField(company,StringType,true),
>>> StructField(id,LongType,true)),true),true)),true),true)),true),true))
>>
>>
>> Its not a great error message, but as the schema above shows, stuff is
>> an array, not a struct.  So, you need to pick a particular element (using
>> []) before you can pull out a specific field.  It would be easier to see
>> this if you ran sqlContext.read.json(s1Rdd).printSchema(), which gives
>> you a tree view.  Try the following.
>>
>>
>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")
>>
>> On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <ewan.le...@realitymine.com>
>> wrote:
>>
>>> When you create a dataframe using the *sqlContext.read.schema()* API,
>>> if you pass in a schema that’s compatible with some of the records, but
>>> incompatible with others, it seems you can’t do a .select on the
>>> problematic columns, instead you get an AnalysisException error.
>>>
>>>
>>>
>>> I know loading the wrong data isn’t good behaviour, but if you’re
>>> reading data from (for example) JSON files, there’s going to be malformed
>>> files along the way. I think it would be nice to handle this error in a
>>> nicer way, though I don’t know the best way to approach it.
>>>
>>>
>>>
>>> Before I raise a JIRA ticket about it, would people consider this to be
>>> a bug or expected behaviour?
>>>
>>>
>>>
>>> I’ve attached a couple of sample JSON files and the steps below to
>>> reproduce it, by taking the inferred schema from the simple1.json file, and
>>> applying it to a union of simple1.json and simple2.json. You can visually
>>> see the data has been parsed as I think you’d want if you do a .select on
>>> the parent column and print out the output, but when you do a select on the
>>> problem column you instead get an exception.
>>>
>>>
>>>
>>> *scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x =>
>>> x._2)*
>>>
>>> s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map
>>> at <console>:27
>>>
>>>
>>>
>>> *scala> val s1schema = sqlContext.read.json(s1Rdd).schema*
>>>
>>> s1schema: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
>>> StructField(name,StringType,true)),true),true),
>>> StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
>>> StructField(id,LongType,true)),true),true)),true),true)),true),true))
>>>
>>>
>>>
>>> *scala>
>>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)*
>>>
>>> [WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don
>>> Joeh]),null], [null,WrappedArray([ACME,2])]))]
>>>
>>> [WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])],
>>> [WrappedArray([2,null]),null]))]
>>>
>>>
>>>
>>> *scala>
>>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")*
>>>
>>> org.apache.spark.sql.AnalysisException: cannot resolve
>>> 'data.stuff[onetype]' due to data type mismatch: argument 2 requires
>>> integral type, however, 'onetype' is of string type.;
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>>
>>>
>>>
>>> (The full exception is attached too).
>>>
>>>
>>>
>>> What do people think, is this a bug?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ewan
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Reply via email to