All right, I looked at the schemas. There is one mismatching nullability, on a scala.Boolean. It looks like an empty Dataset with that *cannot* be nullable. However, when I run my code to generate the Dataset, the schema comes back with nullable = true. Effectively:
scala> val empty = spark.createDataset[SomeClass] scala> empty.printSchema root |-- aCaseClass: struct (nullable = true) | |-- aBool: boolean (nullable = false) scala> val data = // Dataset#flatMap that returns a Dataset[SomeClass] scala> data.printSchema root |-- aCaseClass: struct (nullable = true) | |-- aBool: boolean (nullable = true) scala> empty.union(data) org.apache.spark.sql.AnalysisException: unresolved operator 'Union; If I switch the Boolean to a java.lang.Boolean, I get nullable = true in the empty schema and the union starts working. 1) Is there a fix for this that I can do without jumping through hoops? I don't know of the implications to switching to java.lang.Boolean. 2) It looks like this is probably the issue that these PRs fix: https://github.com/apache/spark/pull/15595 and https://github.com/apache/spark/pull/15602 Is there a timeline for 2.0.2? I'm in a situation where I can't easily build from source. On Mon, Oct 24, 2016 at 12:29 PM Cheng Lian <lian.cs....@gmail.com> wrote: > > > On 10/22/16 1:42 PM, Efe Selcuk wrote: > > Ah, looks similar. Next opportunity I get, I'm going to do a printSchema > on the two datasets and see if they don't match up. > > I assume that unioning the underlying RDDs doesn't run into this problem > because of less type checking or something along those lines? > > Exactly. > > > On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <lian.cs....@gmail.com> wrote: > > Efe - You probably hit this bug: > https://issues.apache.org/jira/browse/SPARK-18058 > > On 10/21/16 2:03 AM, Agraj Mangal wrote: > > I have seen this error sometimes when the elements in the schema have > different nullabilities. Could you print the schema for data and for > someCode.thatReturnsADataset() and see if there is any difference between > the two ? > > On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <efema...@gmail.com> wrote: > > Thanks for the response. What do you mean by "semantically" the same? > They're both Datasets of the same type, which is a case class, so I would > expect compile-time integrity of the data. Is there a situation where this > wouldn't be the case? > > Interestingly enough, if I instead create an empty rdd with > sparkContext.emptyRDD of the same case class type, it works! > > So something like: > var data = spark.sparkContext.emptyRDD[SomeData] > > // loop > data = data.union(someCode.thatReturnsADataset().rdd) > // end loop > > data.toDS //so I can union it to the actual Dataset I have elsewhere > > On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <agraj....@gmail.com> wrote: > > I believe this normally comes when Spark is unable to perform union due to > "difference" in schema of the operands. Can you check if the schema of both > the datasets are semantically same ? > > On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <efema...@gmail.com> wrote: > > Bump! > > On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <efema...@gmail.com> wrote: > > I have a use case where I want to build a dataset based off of > conditionally available data. I thought I'd do something like this: > > case class SomeData( ... ) // parameters are basic encodable types like > strings and BigDecimals > > var data = spark.emptyDataset[SomeData] > > // loop, determining what data to ingest and process into datasets > data = data.union(someCode.thatReturnsADataset) > // end loop > > However I get a runtime exception: > > Exception in thread "main" org.apache.spark.sql.AnalysisException: > unresolved operator 'Union; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161) > at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.union(Dataset.scala:1459) > > Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to > suggestions. However it doesn't seem like I'm doing anything incorrect > here, the types are correct. Searching for this error online returns > results seemingly about working in dataframes and having mismatching > schemas or a different order of fields, and it seems like bugfixes have > gone into place for those cases. > > Thanks in advance. > Efe > > > > > -- > Thanks & Regards, > Agraj Mangal > > > > > -- > Thanks & Regards, > Agraj Mangal > > > >