Re: [Spark 2.0.0] error when unioning to an empty dataset

Cheng Lian Mon, 24 Oct 2016 12:30:06 -0700


On 10/22/16 1:42 PM, Efe Selcuk wrote:

Ah, looks similar. Next opportunity I get, I'm going to do aprintSchema on the two datasets and see if they don't match up.
I assume that unioning the underlying RDDs doesn't run into thisproblem because of less type checking or something along those lines?

Exactly.

On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    Efe - You probably hit this bug:
    https://issues.apache.org/jira/browse/SPARK-18058


    On 10/21/16 2:03 AM, Agraj Mangal wrote:

    I have seen this error sometimes when the elements in the schema
    have different nullabilities. Could you print the schema for
    data and for someCode.thatReturnsADataset() and see if there is
    any difference between the two ?

    On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk <efema...@gmail.com
    <mailto:efema...@gmail.com>> wrote:

        Thanks for the response. What do you mean by "semantically"
        the same? They're both Datasets of the same type, which is a
        case class, so I would expect compile-time integrity of the
        data. Is there a situation where this wouldn't be the case?

        Interestingly enough, if I instead create an empty rdd with
        sparkContext.emptyRDD of the same case class type, it works!

        So something like:
        var data = spark.sparkContext.emptyRDD[SomeData]

        // loop
        data = data.union(someCode.thatReturnsADataset().rdd)
        // end loop

        data.toDS //so I can union it to the actual Dataset I have
        elsewhere

        On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal
        <agraj....@gmail.com <mailto:agraj....@gmail.com>> wrote:

            I believe this normally comes when Spark is unable to
            perform union due to "difference" in schema of the
            operands. Can you check if the schema of both the
            datasets are semantically same ?

            On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk
            <efema...@gmail.com <mailto:efema...@gmail.com>> wrote:

                Bump!

                On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk
                <efema...@gmail.com <mailto:efema...@gmail.com>> wrote:

                    I have a use case where I want to build a dataset
                    based off of conditionally available data. I
                    thought I'd do something like this:

                    case class SomeData( ... ) // parameters are
                    basic encodable types like strings and BigDecimals

                    var data = spark.emptyDataset[SomeData]

                    // loop, determining what data to ingest and
                    process into datasets
                    data = data.union(someCode.thatReturnsADataset)
                    // end loop

                    However I get a runtime exception:

                    Exception in thread "main"
                    org.apache.spark.sql.AnalysisException:
                    unresolved operator 'Union;
                            at
                    
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
                            at
                    
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
                            at
                    
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
                            at
                    
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
                            at
                    
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
                            at
                    
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
                            at
                    
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
                            at
                    
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
                            at
                    org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
                            at
                    org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
                            at
                    org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
                            at
                    
org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
                            at
                    org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

                    Granted, I'm new at Spark so this might be an
                    anti-pattern, so I'm open to suggestions. However
                    it doesn't seem like I'm doing anything incorrect
                    here, the types are correct. Searching for this
                    error online returns results seemingly about
                    working in dataframes and having mismatching
                    schemas or a different order of fields, and it
                    seems like bugfixes have gone into place for
                    those cases.

                    Thanks in advance.
                    Efe

--Thanks & Regards,

            Agraj Mangal

--Thanks & Regards,

    Agraj Mangal

Re: [Spark 2.0.0] error when unioning to an empty dataset

Reply via email to