Hi, unfortunately it is not so straightforward
xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in hive/impala, they are folders in HDFS, and each table has different schema, and in its folder there are one or more parquet files. that means xxxxxx001_suffix xxxxxx002_suffix are folders, there are some parquet files like xxxxxx001_suffix/parquet_file1_with_schema1 xxxxxx002_suffix/parquet_file1_with_schema2 xxxxxx002_suffix/parquet_file2_with_schema2 it seems only union can do this job~ Nonetheless, thank you very much, maybe the only reason is that spark eating up too much memory... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10335.html Sent from the Apache Spark User List mailing list archive at Nabble.com.