Also, do you mean two partitions or two partition columns? If there are many partitions it can be much slower. In Spark 1.5 I'd consider setting spark.sql.hive.metastorePartitionPruning=true if you have predicates over the partition columns.
On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <mich...@databricks.com> wrote: > What format is this table. For parquet and other optimized formats we > cache a bunch of file metadata on first access to make interactive queries > faster. > > On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> wrote: > >> Hello, >> >> I am using SparkSQL to query some Hive tables. Most of the time, when I >> create a DataFrame using sqlContext.sql("select * from table") command, >> DataFrame creation is less than 0.5 second. >> But I have this one table with which it takes almost 12 seconds! >> >> scala> val start = scala.compat.Platform.currentTime; val logs = >> sqlContext.sql("select * from temp.log"); val execution = >> scala.compat.Platform.currentTime - start >> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from >> temp.log >> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed >> start: Long = 1441336022731 >> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int, >> log_time: string, tag: string, dt: string, test_id: int] >> execution: Long = *11567* >> >> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns). >> I have created DataFrames on even larger tables and do not see such >> delay. >> So my questions are: >> - What can impact DataFrame creation time? >> - Is it related to the table partitions? >> >> >> Thanks much your help! >> >> Isabelle >> > >