spark sql union all is slow

2014-10-14 Thread shuluster
I have many tables of same schema, they are partitioned by time. For example
one id could be in many of those table. I would like to find aggregation of
such ids. Originally these tables are located on HDFS as files. Once table
schemaRDD is loaded, I cacheTable on them. Each table is around 30m - 100m
serialized data

The SQL I composed looks like the following:

Select id, sum(cost) as cost from (

(((select id, sum(cost) as cost  from table_1 
where id  = 1 group by id )
union all 
(select id, sum(cost) as cost  from table_2 
where id  = 1 group by id ))
union all 
(select id, sum(cost) as cost  from table_3 
where id  = 1 group by id )) as temp_table

group by id


The call to sparkSqlContext.sql() takes a long time to return a schemaRDD,
the execution of collect of this RDD was not too slow.

IS there something I am doing wrong here? Or Any tips on how to debug?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-union-all-is-slow-tp16407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql union all is slow

2014-10-14 Thread Pei-Lun Lee
Hi,

You can merge them into one table by:

sqlContext.unionAll(sqlContext.unionAll(sqlContext.table(table_1),
sqlContext.table(table_2)),
sqlContext.table(table_3)).registarTempTable(table_all)

Or load them in one call by:

sqlContext.parquetFile(table_1.parquet,table_2.parquet,table_3.parquet).registerTempTable(table_all)

On Wed, Oct 15, 2014 at 2:51 AM, shuluster s...@turn.com wrote:

 I have many tables of same schema, they are partitioned by time. For
 example
 one id could be in many of those table. I would like to find aggregation of
 such ids. Originally these tables are located on HDFS as files. Once table
 schemaRDD is loaded, I cacheTable on them. Each table is around 30m - 100m
 serialized data

 The SQL I composed looks like the following:

 Select id, sum(cost) as cost from (

 (((select id, sum(cost) as cost  from table_1
 where id  = 1 group by id )
 union all
 (select id, sum(cost) as cost  from table_2
 where id  = 1 group by id ))
 union all
 (select id, sum(cost) as cost  from table_3
 where id  = 1 group by id )) as temp_table

 group by id


 The call to sparkSqlContext.sql() takes a long time to return a schemaRDD,
 the execution of collect of this RDD was not too slow.

 IS there something I am doing wrong here? Or Any tips on how to debug?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-union-all-is-slow-tp16407.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org