Thanks. Once you create the jira just reply to this email with the link. On Wednesday, March 2, 2016, Ewan Leith <ewan.le...@realitymine.com> wrote:
> Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if > we can, not sure if my own scala skills will be up to it but perhaps one of > my colleagues' will :) > > Ewan > > I don't think that exists right now, but it's definitely a good option to > have. I myself have run into this issue a few times. > > Can you create a JIRA ticket so we can track it? Would be even better if > you are interested in working on a patch! Thanks. > > > On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith <ewan.le...@realitymine.com > <javascript:_e(%7B%7D,'cvml','ewan.le...@realitymine.com');>> wrote: > >> Hi Reynold, yes that would be perfect for our use case. >> >> I assume it doesn't exist though, otherwise I really need to go re-read the >> docs! >> >> Thanks to both of you for replying by the way, I know you must be hugely >> busy. >> >> Ewan >> >> Are you looking for "relaxed" mode that simply return nulls for fields >> that doesn't exist or have incompatible schema? >> >> >> On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <ewan.le...@realitymine.com >> <javascript:_e(%7B%7D,'cvml','ewan.le...@realitymine.com');>> wrote: >> >>> Thanks Michael, it's not a great example really, as the data I'm working >>> with has some source files that do fit the schema, and some that don't (out >>> of millions that do work, perhaps 10 might not). >>> >>> In an ideal world for us the select would probably return the valid records >>> only. >>> >>> We're trying out the new dataset APIs to see if we can do some >>> pre-filtering that way. >>> >>> Thanks, >>> Ewan >>> >>> -dev +user >>> >>> StructType(StructField(data,ArrayType(StructType(StructField( >>>> *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), >>>> StructField(name,StringType,true)),true),true), StructField(othertype, >>>> ArrayType(StructType(StructField(company,StringType,true), >>>> StructField(id,LongType,true)),true),true)),true),true)),true),true)) >>> >>> >>> Its not a great error message, but as the schema above shows, stuff is >>> an array, not a struct. So, you need to pick a particular element (using >>> []) before you can pull out a specific field. It would be easier to see >>> this if you ran sqlContext.read.json(s1Rdd).printSchema(), which gives >>> you a tree view. Try the following. >>> >>> >>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype") >>> >>> On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <ewan.le...@realitymine.com >>> <javascript:_e(%7B%7D,'cvml','ewan.le...@realitymine.com');>> wrote: >>> >>>> When you create a dataframe using the *sqlContext.read.schema()* API, >>>> if you pass in a schema that’s compatible with some of the records, but >>>> incompatible with others, it seems you can’t do a .select on the >>>> problematic columns, instead you get an AnalysisException error. >>>> >>>> >>>> >>>> I know loading the wrong data isn’t good behaviour, but if you’re >>>> reading data from (for example) JSON files, there’s going to be malformed >>>> files along the way. I think it would be nice to handle this error in a >>>> nicer way, though I don’t know the best way to approach it. >>>> >>>> >>>> >>>> Before I raise a JIRA ticket about it, would people consider this to be >>>> a bug or expected behaviour? >>>> >>>> >>>> >>>> I’ve attached a couple of sample JSON files and the steps below to >>>> reproduce it, by taking the inferred schema from the simple1.json file, and >>>> applying it to a union of simple1.json and simple2.json. You can visually >>>> see the data has been parsed as I think you’d want if you do a .select on >>>> the parent column and print out the output, but when you do a select on the >>>> problem column you instead get an exception. >>>> >>>> >>>> >>>> *scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => >>>> x._2)* >>>> >>>> s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map >>>> at <console>:27 >>>> >>>> >>>> >>>> *scala> val s1schema = sqlContext.read.json(s1Rdd).schema* >>>> >>>> s1schema: org.apache.spark.sql.types.StructType = >>>> StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), >>>> StructField(name,StringType,true)),true),true), >>>> StructField(othertype,ArrayType(StructType(StructField(company,StringType,true), >>>> StructField(id,LongType,true)),true),true)),true),true)),true),true)) >>>> >>>> >>>> >>>> *scala> >>>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)* >>>> >>>> [WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don >>>> Joeh]),null], [null,WrappedArray([ACME,2])]))] >>>> >>>> [WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], >>>> [WrappedArray([2,null]),null]))] >>>> >>>> >>>> >>>> *scala> >>>> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")* >>>> >>>> org.apache.spark.sql.AnalysisException: cannot resolve >>>> 'data.stuff[onetype]' due to data type mismatch: argument 2 requires >>>> integral type, however, 'onetype' is of string type.; >>>> >>>> at >>>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) >>>> >>>> at >>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65) >>>> >>>> at >>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) >>>> >>>> at >>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) >>>> >>>> at >>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) >>>> >>>> >>>> >>>> (The full exception is attached too). >>>> >>>> >>>> >>>> What do people think, is this a bug? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Ewan >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>> <javascript:_e(%7B%7D,'cvml','dev-unsubscr...@spark.apache.org');> >>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>> <javascript:_e(%7B%7D,'cvml','dev-h...@spark.apache.org');> >>>> >>> >>> >> >