Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Reynold Xin Wed, 02 Mar 2016 11:55:47 -0800

I don't think that exists right now, but it's definitely a good option to
have. I myself have run into this issue a few times.


Can you create a JIRA ticket so we can track it? Would be even better if
you are interested in working on a patch! Thanks.


On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> Hi Reynold, yes that would be perfect for our use case.
>
> I assume it doesn't exist though, otherwise I really need to go re-read the 
> docs!
>
> Thanks to both of you for replying by the way, I know you must be hugely busy.
>
> Ewan
>
> Are you looking for "relaxed" mode that simply return nulls for fields
> that doesn't exist or have incompatible schema?
>
>
> On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
>> Thanks Michael, it's not a great example really, as the data I'm working 
>> with has some source files that do fit the schema, and some that don't (out 
>> of millions that do work, perhaps 10 might not).
>>
>> In an ideal world for us the select would probably return the valid records 
>> only.
>>
>> We're trying out the new dataset APIs to see if we can do some pre-filtering 
>> that way.
>>
>> Thanks,
>> Ewan
>>
>> -dev +user
>>
>> StructType(StructField(data,ArrayType(StructType(StructField(
>>> *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
>>> StructField(name,StringType,true)),true),true), StructField(othertype,
>>> ArrayType(StructType(StructField(company,StringType,true),
>>> StructField(id,LongType,true)),true),true)),true),true)),true),true))
>>
>>
>> Its not a great error message, but as the schema above shows, stuff is
>> an array, not a struct.  So, you need to pick a particular element (using
>> []) before you can pull out a specific field.  It would be easier to see
>> this if you ran sqlContext.read.json(s1Rdd).printSchema(), which gives
>> you a tree view.  Try the following.
>>
>>
>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")
>>
>> On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <ewan.le...@realitymine.com>
>> wrote:
>>
>>> When you create a dataframe using the *sqlContext.read.schema()* API,
>>> if you pass in a schema that’s compatible with some of the records, but
>>> incompatible with others, it seems you can’t do a .select on the
>>> problematic columns, instead you get an AnalysisException error.
>>>
>>>
>>>
>>> I know loading the wrong data isn’t good behaviour, but if you’re
>>> reading data from (for example) JSON files, there’s going to be malformed
>>> files along the way. I think it would be nice to handle this error in a
>>> nicer way, though I don’t know the best way to approach it.
>>>
>>>
>>>
>>> Before I raise a JIRA ticket about it, would people consider this to be
>>> a bug or expected behaviour?
>>>
>>>
>>>
>>> I’ve attached a couple of sample JSON files and the steps below to
>>> reproduce it, by taking the inferred schema from the simple1.json file, and
>>> applying it to a union of simple1.json and simple2.json. You can visually
>>> see the data has been parsed as I think you’d want if you do a .select on
>>> the parent column and print out the output, but when you do a select on the
>>> problem column you instead get an exception.
>>>
>>>
>>>
>>> *scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x =>
>>> x._2)*
>>>
>>> s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map
>>> at <console>:27
>>>
>>>
>>>
>>> *scala> val s1schema = sqlContext.read.json(s1Rdd).schema*
>>>
>>> s1schema: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
>>> StructField(name,StringType,true)),true),true),
>>> StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
>>> StructField(id,LongType,true)),true),true)),true),true)),true),true))
>>>
>>>
>>>
>>> *scala>
>>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)*
>>>
>>> [WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don
>>> Joeh]),null], [null,WrappedArray([ACME,2])]))]
>>>
>>> [WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])],
>>> [WrappedArray([2,null]),null]))]
>>>
>>>
>>>
>>> *scala>
>>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")*
>>>
>>> org.apache.spark.sql.AnalysisException: cannot resolve
>>> 'data.stuff[onetype]' due to data type mismatch: argument 2 requires
>>> integral type, however, 'onetype' is of string type.;
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>>
>>>                 at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>>
>>>
>>>
>>> (The full exception is attached too).
>>>
>>>
>>>
>>> What do people think, is this a bug?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ewan
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Reply via email to