Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Michael Armbrust Wed, 02 Mar 2016 10:08:56 -0800

-dev +user

StructType(StructField(data,ArrayType(StructType(StructField(
> *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
> StructField(name,StringType,true)),true),true), StructField(othertype,
> ArrayType(StructType(StructField(company,StringType,true),
> StructField(id,LongType,true)),true),true)),true),true)),true),true))



Its not a great error message, but as the schema above shows, stuff is an
array, not a struct.  So, you need to pick a particular element (using [])
before you can pull out a specific field.  It would be easier to see this
if you ran sqlContext.read.json(s1Rdd).printSchema(), which gives you a
tree view.  Try the following.

sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")

On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> When you create a dataframe using the *sqlContext.read.schema()* API, if
> you pass in a schema that’s compatible with some of the records, but
> incompatible with others, it seems you can’t do a .select on the
> problematic columns, instead you get an AnalysisException error.
>
>
>
> I know loading the wrong data isn’t good behaviour, but if you’re reading
> data from (for example) JSON files, there’s going to be malformed files
> along the way. I think it would be nice to handle this error in a nicer
> way, though I don’t know the best way to approach it.
>
>
>
> Before I raise a JIRA ticket about it, would people consider this to be a
> bug or expected behaviour?
>
>
>
> I’ve attached a couple of sample JSON files and the steps below to
> reproduce it, by taking the inferred schema from the simple1.json file, and
> applying it to a union of simple1.json and simple2.json. You can visually
> see the data has been parsed as I think you’d want if you do a .select on
> the parent column and print out the output, but when you do a select on the
> problem column you instead get an exception.
>
>
>
> *scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)*
>
> s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at
> <console>:27
>
>
>
> *scala> val s1schema = sqlContext.read.json(s1Rdd).schema*
>
> s1schema: org.apache.spark.sql.types.StructType =
> StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
> StructField(name,StringType,true)),true),true),
> StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
> StructField(id,LongType,true)),true),true)),true),true)),true),true))
>
>
>
> *scala>
> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)*
>
> [WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don
> Joeh]),null], [null,WrappedArray([ACME,2])]))]
>
> [WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])],
> [WrappedArray([2,null]),null]))]
>
>
>
> *scala>
> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")*
>
> org.apache.spark.sql.AnalysisException: cannot resolve
> 'data.stuff[onetype]' due to data type mismatch: argument 2 requires
> integral type, however, 'onetype' is of string type.;
>
>                 at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>
>                 at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
>
>                 at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>
>                 at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>
>                 at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>
>
>
> (The full exception is attached too).
>
>
>
> What do people think, is this a bug?
>
>
>
> Thanks,
>
> Ewan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Reply via email to