Re: Query Optimization

Padma Penumarthy Mon, 21 Aug 2017 13:28:46 -0700

That is definitely not the design strategy. Also, I don’t think what you are 
seeing is same
as DRILL-3846.  The difference between with and without metadata caching is a
factor of 2-4 times in DRILL-3846 where as what you see is huge order of 
magnitude
different.


You should file a JIRA and include details that will help us reproduce the 
problem.
Please add as much information as possible.
A sample dataset, how you are creating the table (i.e. partition info), 
logs, query profiles will be very helpful.

Thanks,
Padma


> On Aug 20, 2017, at 7:03 PM, Divya Gehlot <[email protected]> wrote:
> 
> Hi ,
> Yes As Rahul mentioned I am running into a bug
> https://issues.apache.org/jira/browse/DRILL-3846 ?
> 
> As asked the usedMetadataFile is true once I run the Metadata cache query .
> Any tentative or workaorund for the bug?
> 
> Now my ask is if metadata cache is enabled the does Drill reads all the
> files instead of intended ones ?
> Is it Drill design strategy ?
> 
> Thanks,
> divya
> 
> On 18 August 2017 at 12:13, Padma Penumarthy <[email protected]> wrote:
> 
>> It is supposed to work like you expected. May be you are running into a
>> bug.
>> Why is it reading all files after metadata refresh ? That is difficult to
>> answer without
>> looking at the logs and query profile. If you look at the query profile,
>> you can may
>> be check what usedMetadataFile flag says for scan.
>> Also, I am thinking if you created so many files, your metadata
>> cache file could be big. May be you can manually sanity
>> check if it looks ok (look for .drill.parquet.metadata file in the root
>> directory) and not
>> corrupted ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <[email protected]<mailto:kfara
>> [email protected]>> wrote:
>> 
>> Please share your SQL query and the query plan.
>> 
>> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;
>> 
>> 
>> Thanks,
>> 
>> Khurram
>> 
>> ________________________________
>> From: Divya Gehlot <[email protected]<mailto:[email protected]
>>>> 
>> Sent: Friday, August 18, 2017 7:15:18 AM
>> To: [email protected]<mailto:[email protected]>
>> Subject: Re: Query Optimization
>> 
>> Hi ,
>> Yes its the same query its just the ran the metadata refresh command .
>> My understanding is metadata refresh command saves reading the metadata.
>> How about column values ... Why is it reading all the files after metedata
>> refresh ?
>> Partition helps to retrieve data faster .
>> Like in hive how it happens when you mention the partition column in where
>> condition
>> it just goes and read and improves the query performace .
>> In my query also I where conidtion has  partioning column it should go and
>> read those partitioned files right ?
>> Why is it taking more time ?
>> Does the Drill works in different way compare to hive ?
>> 
>> 
>> Thanks,
>> Divya
>> 
>> On 18 August 2017 at 07:37, Padma Penumarthy <[email protected]<mailto:
>> [email protected]>> wrote:
>> 
>> It might read all those files if some new data gets added after running
>> refresh metadata cache.
>> If everything is same before and after metadata refresh i.e. no
>> new data added and query is exactly the same, then it should not do that.
>> Also, check if you can partition in  a way that will not create so many
>> files in the
>> first place.
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <[email protected]<
>> mailto:[email protected]>>
>> wrote:
>> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> 
>> Total number of parquet files in directory  - 102290
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>> 
>> This is how the refresh metadata works I mean it scans each and every
>> files
>> and get the results ?
>> 
>> I dont  have access to logs now .
>> 
>> Thanks,
>> Divya
>> 
>> On 17 August 2017 at 13:48, Divya Gehlot <[email protected]<mailto:
>> [email protected]>>
>> wrote:
>> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>> Thanks,
>> Divya
>> 
>> On 17 August 2017 at 13:03, Padma Penumarthy <[email protected]<mailto:
>> [email protected]>>
>> wrote:
>> 
>> Does your query have partition filter ?
>> Execution time is increased most likely because partition pruning is
>> not
>> happening.
>> Did you get a chance to look at the logs ?  That might give some clues.
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <[email protected]<mailto:
>> [email protected]>>
>> wrote:
>> 
>> Hi,
>> Even I am surprised .
>> I am running Drill version 1.10  on MapR enterprise version.
>> *Query *- Selecting all the columns on partitioned parquet table
>> 
>> I observed few things from Query statistics :
>> 
>> Value
>> 
>> Before Refresh Metadata
>> 
>> After Refresh Metadata
>> 
>> Fragments
>> 
>> 1
>> 
>> 13
>> 
>> DURATION
>> 
>> 01 min 0.233 sec
>> 
>> 18 min 0.744 sec
>> 
>> PLANNING
>> 
>> 59.818 sec
>> 
>> 33.087 sec
>> 
>> QUEUED
>> 
>> Not Available
>> 
>> Not Available
>> 
>> EXECUTION
>> 
>> 0.415 sec
>> 
>> 17 min 27.657 sec
>> 
>> The planning time is being reduced by approx 60% but the execution
>> time
>> increased  drastically.
>> I would like to understand why the exceution time increases after the
>> metadata refresh .
>> 
>> 
>> Appreciate the help.
>> 
>> Thanks,
>> divya
>> 
>> 
>> On 17 August 2017 at 11:54, Padma Penumarthy <[email protected]<mailto:
>> [email protected]>>
>> wrote:
>> 
>> Refresh table metadata should  help reduce query planning time.
>> It is odd that it went up after you did refresh table metadata.
>> Did you check the logs to see what is happening ? You might have to
>> turn on some debugs if needed.
>> BTW, what version of Drill are you running ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <[email protected]<mailto:
>> [email protected]>>
>> wrote:
>> 
>> Hi,
>> I have data in parquet file format .
>> when I run the query the data and see the execution plan I could see
>> following
>> statistics
>> 
>> TOTAL FRAGMENTS: 1
>> DURATION: 01 min 0.233 sec
>> PLANNING: 59.818 sec
>> QUEUED: Not Available
>> EXECUTION: 0.415 sec
>> 
>> 
>> 
>> As its a paquet file format I tried enabling refresh meta data
>> and run below command
>> REFRESH TABLE METADATA <path to table> ;
>> then run the same query again on the same table same data (no
>> changes
>> in
>> data)  and could find the statistics as show below :
>> 
>> TOTAL FRAGMENTS: 13
>> DURATION: 14 min 14.604 sec
>> PLANNING: 33.087 sec
>> QUEUED: Not Available
>> EXECUTION: Not Available
>> 
>> 
>> The query is still running .
>> 
>> Can somebody help me  understand why the query taking so long once I
>> issue
>> the refresh metadata command.
>> 
>> Aprreciate the help !
>> 
>> Thanks,
>> Divya
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>

Re: Query Optimization

Reply via email to