I don't think that exists right now, but it's definitely a good option to have. I myself have run into this issue a few times.
Can you create a JIRA ticket so we can track it? Would be even better if you are interested in working on a patch! Thanks. On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith <ewan.le...@realitymine.com> wrote: > Hi Reynold, yes that would be perfect for our use case. > > I assume it doesn't exist though, otherwise I really need to go re-read the > docs! > > Thanks to both of you for replying by the way, I know you must be hugely busy. > > Ewan > > Are you looking for "relaxed" mode that simply return nulls for fields > that doesn't exist or have incompatible schema? > > > On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <ewan.le...@realitymine.com> > wrote: > >> Thanks Michael, it's not a great example really, as the data I'm working >> with has some source files that do fit the schema, and some that don't (out >> of millions that do work, perhaps 10 might not). >> >> In an ideal world for us the select would probably return the valid records >> only. >> >> We're trying out the new dataset APIs to see if we can do some pre-filtering >> that way. >> >> Thanks, >> Ewan >> >> -dev +user >> >> StructType(StructField(data,ArrayType(StructType(StructField( >>> *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), >>> StructField(name,StringType,true)),true),true), StructField(othertype, >>> ArrayType(StructType(StructField(company,StringType,true), >>> StructField(id,LongType,true)),true),true)),true),true)),true),true)) >> >> >> Its not a great error message, but as the schema above shows, stuff is >> an array, not a struct. So, you need to pick a particular element (using >> []) before you can pull out a specific field. It would be easier to see >> this if you ran sqlContext.read.json(s1Rdd).printSchema(), which gives >> you a tree view. Try the following. >> >> >> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype") >> >> On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <ewan.le...@realitymine.com> >> wrote: >> >>> When you create a dataframe using the *sqlContext.read.schema()* API, >>> if you pass in a schema that’s compatible with some of the records, but >>> incompatible with others, it seems you can’t do a .select on the >>> problematic columns, instead you get an AnalysisException error. >>> >>> >>> >>> I know loading the wrong data isn’t good behaviour, but if you’re >>> reading data from (for example) JSON files, there’s going to be malformed >>> files along the way. I think it would be nice to handle this error in a >>> nicer way, though I don’t know the best way to approach it. >>> >>> >>> >>> Before I raise a JIRA ticket about it, would people consider this to be >>> a bug or expected behaviour? >>> >>> >>> >>> I’ve attached a couple of sample JSON files and the steps below to >>> reproduce it, by taking the inferred schema from the simple1.json file, and >>> applying it to a union of simple1.json and simple2.json. You can visually >>> see the data has been parsed as I think you’d want if you do a .select on >>> the parent column and print out the output, but when you do a select on the >>> problem column you instead get an exception. >>> >>> >>> >>> *scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => >>> x._2)* >>> >>> s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map >>> at <console>:27 >>> >>> >>> >>> *scala> val s1schema = sqlContext.read.json(s1Rdd).schema* >>> >>> s1schema: org.apache.spark.sql.types.StructType = >>> StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), >>> StructField(name,StringType,true)),true),true), >>> StructField(othertype,ArrayType(StructType(StructField(company,StringType,true), >>> StructField(id,LongType,true)),true),true)),true),true)),true),true)) >>> >>> >>> >>> *scala> >>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)* >>> >>> [WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don >>> Joeh]),null], [null,WrappedArray([ACME,2])]))] >>> >>> [WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], >>> [WrappedArray([2,null]),null]))] >>> >>> >>> >>> *scala> >>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")* >>> >>> org.apache.spark.sql.AnalysisException: cannot resolve >>> 'data.stuff[onetype]' due to data type mismatch: argument 2 requires >>> integral type, however, 'onetype' is of string type.; >>> >>> at >>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) >>> >>> at >>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65) >>> >>> at >>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) >>> >>> at >>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) >>> >>> at >>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) >>> >>> >>> >>> (The full exception is attached too). >>> >>> >>> >>> What do people think, is this a bug? >>> >>> >>> >>> Thanks, >>> >>> Ewan >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> >> >