All right, I looked at the schemas. There is one mismatching nullability,
on a scala.Boolean. It looks like an empty Dataset with that *cannot* be
nullable. However, when I run my code to generate the Dataset, the schema
comes back with nullable = true. Effectively:

scala> val empty = spark.createDataset[SomeClass]
scala> empty.printSchema
root
 |-- aCaseClass: struct (nullable = true)
 |    |-- aBool: boolean (nullable = false)


scala> val data = // Dataset#flatMap that returns a Dataset[SomeClass]
scala> data.printSchema
root
 |-- aCaseClass: struct (nullable = true)
 |    |-- aBool: boolean (nullable = true)

scala> empty.union(data)
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

If I switch the Boolean to a java.lang.Boolean, I get nullable = true in
the empty schema and the union starts working.

1) Is there a fix for this that I can do without jumping through hoops? I
don't know of the implications to switching to java.lang.Boolean.

2) It looks like this is probably the issue that these PRs fix:
https://github.com/apache/spark/pull/15595 and
https://github.com/apache/spark/pull/15602  Is there a timeline for 2.0.2?
I'm in a situation where I can't easily build from source.

On Mon, Oct 24, 2016 at 12:29 PM Cheng Lian <lian.cs....@gmail.com> wrote:

>
>
> On 10/22/16 1:42 PM, Efe Selcuk wrote:
>
> Ah, looks similar. Next opportunity I get, I'm going to do a printSchema
> on the two datasets and see if they don't match up.
>
> I assume that unioning the underlying RDDs doesn't run into this problem
> because of less type checking or something along those lines?
>
> Exactly.
>
>
> On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <lian.cs....@gmail.com> wrote:
>
> Efe - You probably hit this bug:
> https://issues.apache.org/jira/browse/SPARK-18058
>
> On 10/21/16 2:03 AM, Agraj Mangal wrote:
>
> I have seen this error sometimes when the elements in the schema have
> different nullabilities. Could you print the schema for data and for
> someCode.thatReturnsADataset() and see if there is any difference between
> the two ?
>
> On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <efema...@gmail.com> wrote:
>
> Thanks for the response. What do you mean by "semantically" the same?
> They're both Datasets of the same type, which is a case class, so I would
> expect compile-time integrity of the data. Is there a situation where this
> wouldn't be the case?
>
> Interestingly enough, if I instead create an empty rdd with
> sparkContext.emptyRDD of the same case class type, it works!
>
> So something like:
> var data = spark.sparkContext.emptyRDD[SomeData]
>
> // loop
>   data = data.union(someCode.thatReturnsADataset().rdd)
> // end loop
>
> data.toDS //so I can union it to the actual Dataset I have elsewhere
>
> On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal <agraj....@gmail.com> wrote:
>
> I believe this normally comes when Spark is unable to perform union due to
> "difference" in schema of the operands. Can you check if the schema of both
> the datasets are semantically same ?
>
> On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk <efema...@gmail.com> wrote:
>
> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk <efema...@gmail.com> wrote:
>
> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>         at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>         at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>         at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>         at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>         at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>
>

Reply via email to