how to merge two dataframes
Hi folks, I have a need to "append" two dataframes -- I was hoping to use UnionAll but it seems that this operation treats the underlying dataframes as sequence of columns, rather than a map. In particular, my problem is that the columns in the two DFs are not in the same order --notice that my customer_id somehow comes out a string: This is Spark 1.4.1 case class Test(epoch: Long,browser:String,customer_id:Int,uri:String) val test = Test(1234l,"firefox",999,"http://foobar;) case class Test1( customer_id :Int,uri:String,browser:String, epoch :Long) val test1 = Test1(888,"http://foobar","ie",12343) val df=sc.parallelize(Seq(test)).toDF val df1=sc.parallelize(Seq(test1)).toDF df.unionAll(df1) //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, customer_id: string, uri: string] Is unionAll the wrong operation? Any special incantations? Or advice on how to otherwise get this to succeeed?
Re: how to merge two dataframes
How about the following ? scala> df.registerTempTable("df") scala> df1.registerTempTable("df1") scala> sql("select customer_id, uri, browser, epoch from df union select customer_id, uri, browser, epoch from df1").show() +---+-+---+-+ |customer_id| uri|browser|epoch| +---+-+---+-+ |999|http://foobar|firefox| 1234| |888|http://foobar| ie|12343| +---+-+---+-+ Cheers On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyskawrote: > Hi folks, > > I have a need to "append" two dataframes -- I was hoping to use UnionAll > but it seems that this operation treats the underlying dataframes as > sequence of columns, rather than a map. > > In particular, my problem is that the columns in the two DFs are not in > the same order --notice that my customer_id somehow comes out a string: > > This is Spark 1.4.1 > > case class Test(epoch: Long,browser:String,customer_id:Int,uri:String) > val test = Test(1234l,"firefox",999,"http://foobar;) > > case class Test1( customer_id :Int,uri:String,browser:String, epoch > :Long) > val test1 = Test1(888,"http://foobar","ie",12343) > val df=sc.parallelize(Seq(test)).toDF > val df1=sc.parallelize(Seq(test1)).toDF > df.unionAll(df1) > > //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, > customer_id: string, uri: string] > > > > Is unionAll the wrong operation? Any special incantations? Or advice on > how to otherwise get this to succeeed? >
Re: how to merge two dataframes
Not a bad idea I suspect but doesn't help me. I dumbed down the repro to ask for help. In reality one of my dataframes is a cassandra DF. So cassDF.registerTempTable("df1") registers the temp table in a different SQL Context (new CassandraSQLContext(sc)). scala> sql("select customer_id, uri, browser, epoch from df union all select customer_id, uri, browser, epoch from df1").show() org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) On Fri, Oct 30, 2015 at 3:34 PM, Ted Yuwrote: > How about the following ? > > scala> df.registerTempTable("df") > scala> df1.registerTempTable("df1") > scala> sql("select customer_id, uri, browser, epoch from df union select > customer_id, uri, browser, epoch from df1").show() > +---+-+---+-+ > |customer_id| uri|browser|epoch| > +---+-+---+-+ > |999|http://foobar|firefox| 1234| > |888|http://foobar| ie|12343| > +---+-+---+-+ > > Cheers > > On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska > wrote: > >> Hi folks, >> >> I have a need to "append" two dataframes -- I was hoping to use UnionAll >> but it seems that this operation treats the underlying dataframes as >> sequence of columns, rather than a map. >> >> In particular, my problem is that the columns in the two DFs are not in >> the same order --notice that my customer_id somehow comes out a string: >> >> This is Spark 1.4.1 >> >> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String) >> val test = Test(1234l,"firefox",999,"http://foobar;) >> >> case class Test1( customer_id :Int,uri:String,browser:String, >> epoch :Long) >> val test1 = Test1(888,"http://foobar","ie",12343) >> val df=sc.parallelize(Seq(test)).toDF >> val df1=sc.parallelize(Seq(test1)).toDF >> df.unionAll(df1) >> >> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, >> customer_id: string, uri: string] >> >> >> >> Is unionAll the wrong operation? Any special incantations? Or advice on >> how to otherwise get this to succeeed? >> > >
Re: how to merge two dataframes
I see - you were trying to union a non-Cassandra DF with Cassandra DF :-( On Fri, Oct 30, 2015 at 12:57 PM, Yana Kadiyskawrote: > Not a bad idea I suspect but doesn't help me. I dumbed down the repro to > ask for help. In reality one of my dataframes is a cassandra DF. > So cassDF.registerTempTable("df1") registers the temp table in a different > SQL Context (new CassandraSQLContext(sc)). > > > scala> sql("select customer_id, uri, browser, epoch from df union all > select customer_id, uri, browser, epoch from df1").show() > org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) > > > On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu wrote: > >> How about the following ? >> >> scala> df.registerTempTable("df") >> scala> df1.registerTempTable("df1") >> scala> sql("select customer_id, uri, browser, epoch from df union select >> customer_id, uri, browser, epoch from df1").show() >> +---+-+---+-+ >> |customer_id| uri|browser|epoch| >> +---+-+---+-+ >> |999|http://foobar|firefox| 1234| >> |888|http://foobar| ie|12343| >> +---+-+---+-+ >> >> Cheers >> >> On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska >> wrote: >> >>> Hi folks, >>> >>> I have a need to "append" two dataframes -- I was hoping to use UnionAll >>> but it seems that this operation treats the underlying dataframes as >>> sequence of columns, rather than a map. >>> >>> In particular, my problem is that the columns in the two DFs are not in >>> the same order --notice that my customer_id somehow comes out a string: >>> >>> This is Spark 1.4.1 >>> >>> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String) >>> val test = Test(1234l,"firefox",999,"http://foobar;) >>> >>> case class Test1( customer_id :Int,uri:String,browser:String, >>> epoch :Long) >>> val test1 = Test1(888,"http://foobar","ie",12343) >>> val df=sc.parallelize(Seq(test)).toDF >>> val df1=sc.parallelize(Seq(test1)).toDF >>> df.unionAll(df1) >>> >>> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, >>> customer_id: string, uri: string] >>> >>> >>> >>> Is unionAll the wrong operation? Any special incantations? Or advice on >>> how to otherwise get this to succeeed? >>> >> >> >
Re: how to merge two dataframes
Are you able to upgrade to Spark 1.5.1 and Cassandra connector to latest version? It no longer requires a separate CassandraSQLContext. From: Yana Kadiyska <yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>> Reply-To: "yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>" <yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>> Date: Friday, October 30, 2015 at 3:57 PM To: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: how to merge two dataframes Not a bad idea I suspect but doesn't help me. I dumbed down the repro to ask for help. In reality one of my dataframes is a cassandra DF. So cassDF.registerTempTable("df1") registers the temp table in a different SQL Context (new CassandraSQLContext(sc)). scala> sql("select customer_id, uri, browser, epoch from df union all select customer_id, uri, browser, epoch from df1").show() org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: How about the following ? scala> df.registerTempTable("df") scala> df1.registerTempTable("df1") scala> sql("select customer_id, uri, browser, epoch from df union select customer_id, uri, browser, epoch from df1").show() +---+-+---+-+ |customer_id| uri|browser|epoch| +---+-+---+-+ |999|http://foobar|firefox| 1234| |888|http://foobar| ie|12343| +---+-+---+-+ Cheers On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska <yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>> wrote: Hi folks, I have a need to "append" two dataframes -- I was hoping to use UnionAll but it seems that this operation treats the underlying dataframes as sequence of columns, rather than a map. In particular, my problem is that the columns in the two DFs are not in the same order --notice that my customer_id somehow comes out a string: This is Spark 1.4.1 case class Test(epoch: Long,browser:String,customer_id:Int,uri:String) val test = Test(1234l,"firefox",999,"http://foobar;) case class Test1( customer_id :Int,uri:String,browser:String, epoch :Long) val test1 = Test1(888,"http://foobar","ie",12343) val df=sc.parallelize(Seq(test)).toDF val df1=sc.parallelize(Seq(test1)).toDF df.unionAll(df1) //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, customer_id: string, uri: string] Is unionAll the wrong operation? Any special incantations? Or advice on how to otherwise get this to succeeed?