how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Hi folks,

I have a need to "append" two dataframes -- I was hoping to use UnionAll
but it seems that this operation treats the underlying dataframes as
sequence of columns, rather than a map.

In particular, my problem is that the columns in the two DFs are not in the
same order --notice that my customer_id somehow comes out a string:

This is Spark 1.4.1

case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
val test = Test(1234l,"firefox",999,"http://foobar;)

case class Test1( customer_id :Int,uri:String,browser:String,
 epoch :Long)
val test1 = Test1(888,"http://foobar","ie",12343)
val df=sc.parallelize(Seq(test)).toDF
val df1=sc.parallelize(Seq(test1)).toDF
df.unionAll(df1)

//res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser:
string, customer_id: string, uri: string]

​

Is unionAll the wrong operation? Any special incantations? Or advice on how
to otherwise get this to succeeed?


Re: how to merge two dataframes

2015-10-30 Thread Ted Yu
How about the following ?

scala> df.registerTempTable("df")
scala> df1.registerTempTable("df1")
scala> sql("select customer_id, uri, browser, epoch from df union select
customer_id, uri, browser, epoch from df1").show()
+---+-+---+-+
|customer_id|  uri|browser|epoch|
+---+-+---+-+
|999|http://foobar|firefox| 1234|
|888|http://foobar| ie|12343|
+---+-+---+-+

Cheers

On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
wrote:

> Hi folks,
>
> I have a need to "append" two dataframes -- I was hoping to use UnionAll
> but it seems that this operation treats the underlying dataframes as
> sequence of columns, rather than a map.
>
> In particular, my problem is that the columns in the two DFs are not in
> the same order --notice that my customer_id somehow comes out a string:
>
> This is Spark 1.4.1
>
> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
> val test = Test(1234l,"firefox",999,"http://foobar;)
>
> case class Test1( customer_id :Int,uri:String,browser:String,   epoch 
> :Long)
> val test1 = Test1(888,"http://foobar","ie",12343)
> val df=sc.parallelize(Seq(test)).toDF
> val df1=sc.parallelize(Seq(test1)).toDF
> df.unionAll(df1)
>
> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
> customer_id: string, uri: string]
>
> ​
>
> Is unionAll the wrong operation? Any special incantations? Or advice on
> how to otherwise get this to succeeed?
>


Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Not a bad idea I suspect but doesn't help me. I dumbed down the repro to
ask for help. In reality one of my dataframes is a cassandra DF.
So cassDF.registerTempTable("df1") registers the temp table in a different
SQL Context (new CassandraSQLContext(sc)).


scala> sql("select customer_id, uri, browser, epoch from df union all
select customer_id, uri, browser, epoch from df1").show()
org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)


On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu  wrote:

> How about the following ?
>
> scala> df.registerTempTable("df")
> scala> df1.registerTempTable("df1")
> scala> sql("select customer_id, uri, browser, epoch from df union select
> customer_id, uri, browser, epoch from df1").show()
> +---+-+---+-+
> |customer_id|  uri|browser|epoch|
> +---+-+---+-+
> |999|http://foobar|firefox| 1234|
> |888|http://foobar| ie|12343|
> +---+-+---+-+
>
> Cheers
>
> On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
> wrote:
>
>> Hi folks,
>>
>> I have a need to "append" two dataframes -- I was hoping to use UnionAll
>> but it seems that this operation treats the underlying dataframes as
>> sequence of columns, rather than a map.
>>
>> In particular, my problem is that the columns in the two DFs are not in
>> the same order --notice that my customer_id somehow comes out a string:
>>
>> This is Spark 1.4.1
>>
>> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
>> val test = Test(1234l,"firefox",999,"http://foobar;)
>>
>> case class Test1( customer_id :Int,uri:String,browser:String,   
>> epoch :Long)
>> val test1 = Test1(888,"http://foobar","ie",12343)
>> val df=sc.parallelize(Seq(test)).toDF
>> val df1=sc.parallelize(Seq(test1)).toDF
>> df.unionAll(df1)
>>
>> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
>> customer_id: string, uri: string]
>>
>> ​
>>
>> Is unionAll the wrong operation? Any special incantations? Or advice on
>> how to otherwise get this to succeeed?
>>
>
>


Re: how to merge two dataframes

2015-10-30 Thread Ted Yu
I see - you were trying to union a non-Cassandra DF with Cassandra DF :-(

On Fri, Oct 30, 2015 at 12:57 PM, Yana Kadiyska 
wrote:

> Not a bad idea I suspect but doesn't help me. I dumbed down the repro to
> ask for help. In reality one of my dataframes is a cassandra DF.
> So cassDF.registerTempTable("df1") registers the temp table in a different
> SQL Context (new CassandraSQLContext(sc)).
>
>
> scala> sql("select customer_id, uri, browser, epoch from df union all
> select customer_id, uri, browser, epoch from df1").show()
> org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103
> at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
>
>
> On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu  wrote:
>
>> How about the following ?
>>
>> scala> df.registerTempTable("df")
>> scala> df1.registerTempTable("df1")
>> scala> sql("select customer_id, uri, browser, epoch from df union select
>> customer_id, uri, browser, epoch from df1").show()
>> +---+-+---+-+
>> |customer_id|  uri|browser|epoch|
>> +---+-+---+-+
>> |999|http://foobar|firefox| 1234|
>> |888|http://foobar| ie|12343|
>> +---+-+---+-+
>>
>> Cheers
>>
>> On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
>> wrote:
>>
>>> Hi folks,
>>>
>>> I have a need to "append" two dataframes -- I was hoping to use UnionAll
>>> but it seems that this operation treats the underlying dataframes as
>>> sequence of columns, rather than a map.
>>>
>>> In particular, my problem is that the columns in the two DFs are not in
>>> the same order --notice that my customer_id somehow comes out a string:
>>>
>>> This is Spark 1.4.1
>>>
>>> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
>>> val test = Test(1234l,"firefox",999,"http://foobar;)
>>>
>>> case class Test1( customer_id :Int,uri:String,browser:String,   
>>> epoch :Long)
>>> val test1 = Test1(888,"http://foobar","ie",12343)
>>> val df=sc.parallelize(Seq(test)).toDF
>>> val df1=sc.parallelize(Seq(test1)).toDF
>>> df.unionAll(df1)
>>>
>>> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
>>> customer_id: string, uri: string]
>>>
>>> ​
>>>
>>> Is unionAll the wrong operation? Any special incantations? Or advice on
>>> how to otherwise get this to succeeed?
>>>
>>
>>
>


Re: how to merge two dataframes

2015-10-30 Thread Silvio Fiorito
Are you able to upgrade to Spark 1.5.1 and Cassandra connector to latest 
version? It no longer requires a separate CassandraSQLContext.

From: Yana Kadiyska <yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>>
Reply-To: "yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>" 
<yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>>
Date: Friday, October 30, 2015 at 3:57 PM
To: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: how to merge two dataframes

Not a bad idea I suspect but doesn't help me. I dumbed down the repro to ask 
for help. In reality one of my dataframes is a cassandra DF. So 
cassDF.registerTempTable("df1") registers the temp table in a different SQL 
Context (new CassandraSQLContext(sc)).


scala> sql("select customer_id, uri, browser, epoch from df union all select 
customer_id, uri, browser, epoch from df1").show()
org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)


On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
How about the following ?

scala> df.registerTempTable("df")
scala> df1.registerTempTable("df1")
scala> sql("select customer_id, uri, browser, epoch from df union select 
customer_id, uri, browser, epoch from df1").show()
+---+-+---+-+
|customer_id|  uri|browser|epoch|
+---+-+---+-+
|999|http://foobar|firefox| 1234|
|888|http://foobar| ie|12343|
+---+-+---+-+

Cheers

On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
<yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>> wrote:
Hi folks,

I have a need to "append" two dataframes -- I was hoping to use UnionAll but it 
seems that this operation treats the underlying dataframes as sequence of 
columns, rather than a map.

In particular, my problem is that the columns in the two DFs are not in the 
same order --notice that my customer_id somehow comes out a string:

This is Spark 1.4.1

case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
val test = Test(1234l,"firefox",999,"http://foobar;)

case class Test1( customer_id :Int,uri:String,browser:String,   epoch 
:Long)
val test1 = Test1(888,"http://foobar","ie",12343)
val df=sc.parallelize(Seq(test)).toDF
val df1=sc.parallelize(Seq(test1)).toDF
df.unionAll(df1)

//res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
customer_id: string, uri: string]


​

Is unionAll the wrong operation? Any special incantations? Or advice on how to 
otherwise get this to succeeed?