Hi Michael, We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are on cloudera), and Parquet formatted tables. I turned on spark .sql.hive.metastorePartitionPruning=true, but DataFrame creation still takes a long time. Is there any other configuration to consider?
Thanks a lot for your help, Isabelle On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > If you run sqlContext.table("...").registerTempTable("...") that > temptable will cache the lookup of partitions. > > On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote: > >> Hi Michael, >> >> Thanks a lot for your reply. >> >> This table is stored as text file with tab delimited columns. >> >> You are correct, the problem is because my table has too many partitions >> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984 >> <https://issues.apache.org/jira/browse/SPARK-6984>. >> >> Not sure when my company can move to 1.5. Would you know some workaround >> for this bug? >> If I cannot find workaround for this, will have to change our schema >> design to reduce number of partitions. >> >> >> Thanks, >> >> Isabelle >> >> >> >> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> Also, do you mean two partitions or two partition columns? If there are >>> many partitions it can be much slower. In Spark 1.5 I'd consider setting >>> spark.sql.hive.metastorePartitionPruning=true >>> if you have predicates over the partition columns. >>> >>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> What format is this table. For parquet and other optimized formats we >>>> cache a bunch of file metadata on first access to make interactive queries >>>> faster. >>>> >>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> I am using SparkSQL to query some Hive tables. Most of the time, when >>>>> I create a DataFrame using sqlContext.sql("select * from table") command, >>>>> DataFrame creation is less than 0.5 second. >>>>> But I have this one table with which it takes almost 12 seconds! >>>>> >>>>> scala> val start = scala.compat.Platform.currentTime; val logs = >>>>> sqlContext.sql("select * from temp.log"); val execution = >>>>> scala.compat.Platform.currentTime - start >>>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from >>>>> temp.log >>>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed >>>>> start: Long = 1441336022731 >>>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int, >>>>> log_time: string, tag: string, dt: string, test_id: int] >>>>> execution: Long = *11567* >>>>> >>>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id >>>>> columns). >>>>> I have created DataFrames on even larger tables and do not see such >>>>> delay. >>>>> So my questions are: >>>>> - What can impact DataFrame creation time? >>>>> - Is it related to the table partitions? >>>>> >>>>> >>>>> Thanks much your help! >>>>> >>>>> Isabelle >>>>> >>>> >>>> >>> >> >