Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-24 Thread Efe Selcuk
All right, I looked at the schemas. There is one mismatching nullability,
on a scala.Boolean. It looks like an empty Dataset with that *cannot* be
nullable. However, when I run my code to generate the Dataset, the schema
comes back with nullable = true. Effectively:

scala> val empty = spark.createDataset[SomeClass]
scala> empty.printSchema
root
 |-- aCaseClass: struct (nullable = true)
 ||-- aBool: boolean (nullable = false)


scala> val data = // Dataset#flatMap that returns a Dataset[SomeClass]
scala> data.printSchema
root
 |-- aCaseClass: struct (nullable = true)
 ||-- aBool: boolean (nullable = true)

scala> empty.union(data)
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

If I switch the Boolean to a java.lang.Boolean, I get nullable = true in
the empty schema and the union starts working.

1) Is there a fix for this that I can do without jumping through hoops? I
don't know of the implications to switching to java.lang.Boolean.

2) It looks like this is probably the issue that these PRs fix:
https://github.com/apache/spark/pull/15595 and
https://github.com/apache/spark/pull/15602  Is there a timeline for 2.0.2?
I'm in a situation where I can't easily build from source.

On Mon, Oct 24, 2016 at 12:29 PM Cheng Lian  wrote:

>
>
> On 10/22/16 1:42 PM, Efe Selcuk wrote:
>
> Ah, looks similar. Next opportunity I get, I'm going to do a printSchema
> on the two datasets and see if they don't match up.
>
> I assume that unioning the underlying RDDs doesn't run into this problem
> because of less type checking or something along those lines?
>
> Exactly.
>
>
> On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian  wrote:
>
> Efe - You probably hit this bug:
> https://issues.apache.org/jira/browse/SPARK-18058
>
> On 10/21/16 2:03 AM, Agraj Mangal wrote:
>
> I have seen this error sometimes when the elements in the schema have
> different nullabilities. Could you print the schema for data and for
> someCode.thatReturnsADataset() and see if there is any difference between
> the two ?
>
> On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk  wrote:
>
> Thanks for the response. What do you mean by "semantically" the same?
> They're both Datasets of the same type, which is a case class, so I would
> expect compile-time integrity of the data. Is there a situation where this
> wouldn't be the case?
>
> Interestingly enough, if I instead create an empty rdd with
> sparkContext.emptyRDD of the same case class type, it works!
>
> So something like:
> var data = spark.sparkContext.emptyRDD[SomeData]
>
> // loop
>   data = data.union(someCode.thatReturnsADataset().rdd)
> // end loop
>
> data.toDS //so I can union it to the actual Dataset I have elsewhere
>
> On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal  wrote:
>
> I believe this normally comes when Spark is unable to perform union due to
> "difference" in schema of the operands. Can you check if the schema of both
> the datasets are semantically same ?
>
> On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk  wrote:
>
> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk  wrote:
>
> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
> at org.apache.spark.sql.Dataset.(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an 

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-24 Thread Cheng Lian



On 10/22/16 1:42 PM, Efe Selcuk wrote:
Ah, looks similar. Next opportunity I get, I'm going to do a 
printSchema on the two datasets and see if they don't match up.


I assume that unioning the underlying RDDs doesn't run into this 
problem because of less type checking or something along those lines?

Exactly.


On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian > wrote:


Efe - You probably hit this bug:
https://issues.apache.org/jira/browse/SPARK-18058


On 10/21/16 2:03 AM, Agraj Mangal wrote:

I have seen this error sometimes when the elements in the schema
have different nullabilities. Could you print the schema for
data and for someCode.thatReturnsADataset() and see if there is
any difference between the two ?

On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk > wrote:

Thanks for the response. What do you mean by "semantically"
the same? They're both Datasets of the same type, which is a
case class, so I would expect compile-time integrity of the
data. Is there a situation where this wouldn't be the case?

Interestingly enough, if I instead create an empty rdd with
sparkContext.emptyRDD of the same case class type, it works!

So something like:
var data = spark.sparkContext.emptyRDD[SomeData]

// loop
data = data.union(someCode.thatReturnsADataset().rdd)
// end loop

data.toDS //so I can union it to the actual Dataset I have
elsewhere

On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal
> wrote:

I believe this normally comes when Spark is unable to
perform union due to "difference" in schema of the
operands. Can you check if the schema of both the
datasets are semantically same ?

On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk
> wrote:

Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk
> wrote:

I have a use case where I want to build a dataset
based off of conditionally available data. I
thought I'd do something like this:

case class SomeData( ... ) // parameters are
basic encodable types like strings and BigDecimals

var data = spark.emptyDataset[SomeData]

// loop, determining what data to ingest and
process into datasets
data = data.union(someCode.thatReturnsADataset)
// end loop

However I get a runtime exception:

Exception in thread "main"
org.apache.spark.sql.AnalysisException:
unresolved operator 'Union;
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at

org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at

org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at

org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at

org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at
org.apache.spark.sql.Dataset.(Dataset.scala:161)
at
org.apache.spark.sql.Dataset.(Dataset.scala:167)
at
org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at

org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
at
org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

Granted, I'm new at Spark so this might be an
anti-pattern, so I'm open to suggestions. However
it doesn't seem like I'm doing anything incorrect
here, the types are correct. Searching for this

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-22 Thread Efe Selcuk
Ah, looks similar. Next opportunity I get, I'm going to do a printSchema on
the two datasets and see if they don't match up.

I assume that unioning the underlying RDDs doesn't run into this problem
because of less type checking or something along those lines?

On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian  wrote:

> Efe - You probably hit this bug:
> https://issues.apache.org/jira/browse/SPARK-18058
>
> On 10/21/16 2:03 AM, Agraj Mangal wrote:
>
> I have seen this error sometimes when the elements in the schema have
> different nullabilities. Could you print the schema for data and for
> someCode.thatReturnsADataset() and see if there is any difference between
> the two ?
>
> On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk  wrote:
>
> Thanks for the response. What do you mean by "semantically" the same?
> They're both Datasets of the same type, which is a case class, so I would
> expect compile-time integrity of the data. Is there a situation where this
> wouldn't be the case?
>
> Interestingly enough, if I instead create an empty rdd with
> sparkContext.emptyRDD of the same case class type, it works!
>
> So something like:
> var data = spark.sparkContext.emptyRDD[SomeData]
>
> // loop
>   data = data.union(someCode.thatReturnsADataset().rdd)
> // end loop
>
> data.toDS //so I can union it to the actual Dataset I have elsewhere
>
> On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal  wrote:
>
> I believe this normally comes when Spark is unable to perform union due to
> "difference" in schema of the operands. Can you check if the schema of both
> the datasets are semantically same ?
>
> On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk  wrote:
>
> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk  wrote:
>
> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
> at org.apache.spark.sql.Dataset.(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>
>


Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-21 Thread Cheng Lian
Efe - You probably hit this bug: 
https://issues.apache.org/jira/browse/SPARK-18058



On 10/21/16 2:03 AM, Agraj Mangal wrote:
I have seen this error sometimes when the elements in the schema have 
different nullabilities. Could you print the schema for data and for 
someCode.thatReturnsADataset() and see if there is any difference 
between the two ?


On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk > wrote:


Thanks for the response. What do you mean by "semantically" the
same? They're both Datasets of the same type, which is a case
class, so I would expect compile-time integrity of the data. Is
there a situation where this wouldn't be the case?

Interestingly enough, if I instead create an empty rdd with
sparkContext.emptyRDD of the same case class type, it works!

So something like:
var data = spark.sparkContext.emptyRDD[SomeData]

// loop
data = data.union(someCode.thatReturnsADataset().rdd)
// end loop

data.toDS //so I can union it to the actual Dataset I have elsewhere

On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal > wrote:

I believe this normally comes when Spark is unable to perform
union due to "difference" in schema of the operands. Can you
check if the schema of both the datasets are semantically same ?

On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk
> wrote:

Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk
> wrote:

I have a use case where I want to build a dataset
based off of conditionally available data. I thought
I'd do something like this:

case class SomeData( ... ) // parameters are basic
encodable types like strings and BigDecimals

var data = spark.emptyDataset[SomeData]

// loop, determining what data to ingest and process
into datasets
  data = data.union(someCode.thatReturnsADataset)
// end loop

However I get a runtime exception:

Exception in thread "main"
org.apache.spark.sql.AnalysisException: unresolved
operator 'Union;
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at

org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at

org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at

org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at

org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at

org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at
org.apache.spark.sql.Dataset.(Dataset.scala:161)
at
org.apache.spark.sql.Dataset.(Dataset.scala:167)
at
org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at
org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
at
org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

Granted, I'm new at Spark so this might be an
anti-pattern, so I'm open to suggestions. However it
doesn't seem like I'm doing anything incorrect here,
the types are correct. Searching for this error online
returns results seemingly about working in dataframes
and having mismatching schemas or a different order of
fields, and it seems like bugfixes have gone into
place for those cases.

Thanks in advance.
Efe




-- 
Thanks & Regards,

Agraj Mangal




--
Thanks & Regards,
Agraj Mangal




Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-21 Thread Agraj Mangal
I have seen this error sometimes when the elements in the schema have
different nullabilities. Could you print the schema for data and for
someCode.thatReturnsADataset() and see if there is any difference between
the two ?

On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk  wrote:

> Thanks for the response. What do you mean by "semantically" the same?
> They're both Datasets of the same type, which is a case class, so I would
> expect compile-time integrity of the data. Is there a situation where this
> wouldn't be the case?
>
> Interestingly enough, if I instead create an empty rdd with
> sparkContext.emptyRDD of the same case class type, it works!
>
> So something like:
> var data = spark.sparkContext.emptyRDD[SomeData]
>
> // loop
>   data = data.union(someCode.thatReturnsADataset().rdd)
> // end loop
>
> data.toDS //so I can union it to the actual Dataset I have elsewhere
>
> On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal  wrote:
>
> I believe this normally comes when Spark is unable to perform union due to
> "difference" in schema of the operands. Can you check if the schema of both
> the datasets are semantically same ?
>
> On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk  wrote:
>
> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk  wrote:
>
> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> failAnalysis(CheckAnalysis.scala:40)
> at org.apache.spark.sql.catalyst.analysis.Analyzer.
> failAnalysis(Analyzer.scala:58)
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:126)
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> checkAnalysis(CheckAnalysis.scala:67)
> at org.apache.spark.sql.catalyst.analysis.Analyzer.
> checkAnalysis(Analyzer.scala:58)
> at org.apache.spark.sql.execution.QueryExecution.
> assertAnalyzed(QueryExecution.scala:49)
> at org.apache.spark.sql.Dataset.(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>
>
>
> --
> Thanks & Regards,
> Agraj Mangal
>
>


-- 
Thanks & Regards,
Agraj Mangal


Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-20 Thread Efe Selcuk
Thanks for the response. What do you mean by "semantically" the same?
They're both Datasets of the same type, which is a case class, so I would
expect compile-time integrity of the data. Is there a situation where this
wouldn't be the case?

Interestingly enough, if I instead create an empty rdd with
sparkContext.emptyRDD of the same case class type, it works!

So something like:
var data = spark.sparkContext.emptyRDD[SomeData]

// loop
  data = data.union(someCode.thatReturnsADataset().rdd)
// end loop

data.toDS //so I can union it to the actual Dataset I have elsewhere

On Thu, Oct 20, 2016 at 8:34 PM Agraj Mangal  wrote:

I believe this normally comes when Spark is unable to perform union due to
"difference" in schema of the operands. Can you check if the schema of both
the datasets are semantically same ?

On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk  wrote:

Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk  wrote:

I have a use case where I want to build a dataset based off of
conditionally available data. I thought I'd do something like this:

case class SomeData( ... ) // parameters are basic encodable types like
strings and BigDecimals

var data = spark.emptyDataset[SomeData]

// loop, determining what data to ingest and process into datasets
  data = data.union(someCode.thatReturnsADataset)
// end loop

However I get a runtime exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException:
unresolved operator 'Union;
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset.(Dataset.scala:161)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
suggestions. However it doesn't seem like I'm doing anything incorrect
here, the types are correct. Searching for this error online returns
results seemingly about working in dataframes and having mismatching
schemas or a different order of fields, and it seems like bugfixes have
gone into place for those cases.

Thanks in advance.
Efe




-- 
Thanks & Regards,
Agraj Mangal


Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-20 Thread Agraj Mangal
I believe this normally comes when Spark is unable to perform union due to
"difference" in schema of the operands. Can you check if the schema of both
the datasets are semantically same ?

On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk  wrote:

> Bump!
>
> On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk  wrote:
>
>> I have a use case where I want to build a dataset based off of
>> conditionally available data. I thought I'd do something like this:
>>
>> case class SomeData( ... ) // parameters are basic encodable types like
>> strings and BigDecimals
>>
>> var data = spark.emptyDataset[SomeData]
>>
>> // loop, determining what data to ingest and process into datasets
>>   data = data.union(someCode.thatReturnsADataset)
>> // end loop
>>
>> However I get a runtime exception:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> unresolved operator 'Union;
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
>> failAnalysis(CheckAnalysis.scala:40)
>> at org.apache.spark.sql.catalyst.analysis.Analyzer.
>> failAnalysis(Analyzer.scala:58)
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
>> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
>> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
>> TreeNode.scala:126)
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
>> checkAnalysis(CheckAnalysis.scala:67)
>> at org.apache.spark.sql.catalyst.analysis.Analyzer.
>> checkAnalysis(Analyzer.scala:58)
>> at org.apache.spark.sql.execution.QueryExecution.
>> assertAnalyzed(QueryExecution.scala:49)
>> at org.apache.spark.sql.Dataset.(Dataset.scala:161)
>> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>>
>> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open
>> to suggestions. However it doesn't seem like I'm doing anything incorrect
>> here, the types are correct. Searching for this error online returns
>> results seemingly about working in dataframes and having mismatching
>> schemas or a different order of fields, and it seems like bugfixes have
>> gone into place for those cases.
>>
>> Thanks in advance.
>> Efe
>>
>>


-- 
Thanks & Regards,
Agraj Mangal


Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-17 Thread Efe Selcuk
Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk  wrote:

> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
> at org.apache.spark.sql.Dataset.(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>


[Spark 2.0.0] error when unioning to an empty dataset

2016-10-13 Thread Efe Selcuk
I have a use case where I want to build a dataset based off of
conditionally available data. I thought I'd do something like this:

case class SomeData( ... ) // parameters are basic encodable types like
strings and BigDecimals

var data = spark.emptyDataset[SomeData]

// loop, determining what data to ingest and process into datasets
  data = data.union(someCode.thatReturnsADataset)
// end loop

However I get a runtime exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException:
unresolved operator 'Union;
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset.(Dataset.scala:161)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)

Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
suggestions. However it doesn't seem like I'm doing anything incorrect
here, the types are correct. Searching for this error online returns
results seemingly about working in dataframes and having mismatching
schemas or a different order of fields, and it seems like bugfixes have
gone into place for those cases.

Thanks in advance.
Efe