Re: DataFrame creation delay?

Isabelle Phan Thu, 10 Dec 2015 14:29:46 -0800

Hi Michael,

We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
on cloudera), and Parquet formatted tables. I turned on  spark
.sql.hive.metastorePartitionPruning=true, but DataFrame creation still
takes a long time.
Is there any other configuration to consider?



Thanks a lot for your help,

Isabelle

On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> If you run sqlContext.table("...").registerTempTable("...") that
> temptable will cache the lookup of partitions.
>
> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>
>> Hi Michael,
>>
>> Thanks a lot for your reply.
>>
>> This table is stored as text file with tab delimited columns.
>>
>> You are correct, the problem is because my table has too many partitions
>> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
>> <https://issues.apache.org/jira/browse/SPARK-6984>.
>>
>> Not sure when my company can move to 1.5. Would you know some workaround
>> for this bug?
>> If I cannot find workaround for this, will have to change our schema
>> design to reduce number of partitions.
>>
>>
>> Thanks,
>>
>> Isabelle
>>
>>
>>
>> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> Also, do you mean two partitions or two partition columns?  If there are
>>> many partitions it can be much slower.  In Spark 1.5 I'd consider setting 
>>> spark.sql.hive.metastorePartitionPruning=true
>>> if you have predicates over the partition columns.
>>>
>>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> What format is this table.  For parquet and other optimized formats we
>>>> cache a bunch of file metadata on first access to make interactive queries
>>>> faster.
>>>>
>>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am using SparkSQL to query some Hive tables. Most of the time, when
>>>>> I create a DataFrame using sqlContext.sql("select * from table") command,
>>>>> DataFrame creation is less than 0.5 second.
>>>>> But I have this one table with which it takes almost 12 seconds!
>>>>>
>>>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>>>> sqlContext.sql("select * from temp.log"); val execution =
>>>>> scala.compat.Platform.currentTime - start
>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>>>> temp.log
>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>>>> start: Long = 1441336022731
>>>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>>>>> log_time: string, tag: string, dt: string, test_id: int]
>>>>> execution: Long = *11567*
>>>>>
>>>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id
>>>>> columns).
>>>>> I have created DataFrames on even larger tables and do not see such
>>>>> delay.
>>>>> So my questions are:
>>>>> - What can impact DataFrame creation time?
>>>>> - Is it related to the table partitions?
>>>>>
>>>>>
>>>>> Thanks much your help!
>>>>>
>>>>> Isabelle
>>>>>
>>>>
>>>>
>>>
>>
>

Re: DataFrame creation delay?

Reply via email to