Also, do you mean two partitions or two partition columns?  If there are
many partitions it can be much slower.  In Spark 1.5 I'd consider
setting spark.sql.hive.metastorePartitionPruning=true
if you have predicates over the partition columns.

On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> What format is this table.  For parquet and other optimized formats we
> cache a bunch of file metadata on first access to make interactive queries
> faster.
>
> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>
>> Hello,
>>
>> I am using SparkSQL to query some Hive tables. Most of the time, when I
>> create a DataFrame using sqlContext.sql("select * from table") command,
>> DataFrame creation is less than 0.5 second.
>> But I have this one table with which it takes almost 12 seconds!
>>
>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>> sqlContext.sql("select * from temp.log"); val execution =
>> scala.compat.Platform.currentTime - start
>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>> temp.log
>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>> start: Long = 1441336022731
>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>> log_time: string, tag: string, dt: string, test_id: int]
>> execution: Long = *11567*
>>
>> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns).
>> I have created DataFrames on even larger tables and do not see such
>> delay.
>> So my questions are:
>> - What can impact DataFrame creation time?
>> - Is it related to the table partitions?
>>
>>
>> Thanks much your help!
>>
>> Isabelle
>>
>
>

Reply via email to