Hi,
You can merge them into one table by:
sqlContext.unionAll(sqlContext.unionAll(sqlContext.table(table_1),
sqlContext.table(table_2)),
sqlContext.table(table_3)).registarTempTable(table_all)
Or load them in one call by:
sqlContext.parquetFile(table_1.parquet,table_2.parquet,table_3.parquet).registerTempTable(table_all)
On Wed, Oct 15, 2014 at 2:51 AM, shuluster s...@turn.com wrote:
I have many tables of same schema, they are partitioned by time. For
example
one id could be in many of those table. I would like to find aggregation of
such ids. Originally these tables are located on HDFS as files. Once table
schemaRDD is loaded, I cacheTable on them. Each table is around 30m - 100m
serialized data
The SQL I composed looks like the following:
Select id, sum(cost) as cost from (
(((select id, sum(cost) as cost from table_1
where id = 1 group by id )
union all
(select id, sum(cost) as cost from table_2
where id = 1 group by id ))
union all
(select id, sum(cost) as cost from table_3
where id = 1 group by id )) as temp_table
group by id
The call to sparkSqlContext.sql() takes a long time to return a schemaRDD,
the execution of collect of this RDD was not too slow.
IS there something I am doing wrong here? Or Any tips on how to debug?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-union-all-is-slow-tp16407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org