Re: Query Optimization

Divya Gehlot Wed, 23 Aug 2017 19:53:59 -0700

Hi ,
Here is the JIRA  DRILL-5739
<https://issues.apache.org/jira/browse/DRILL-5739>  link  of the issue .


Thanks,
Divya

On 22 August 2017 at 04:28, Padma Penumarthy <[email protected]> wrote:

> That is definitely not the design strategy. Also, I don’t think what you
> are seeing is same
> as DRILL-3846.  The difference between with and without metadata caching
> is a
> factor of 2-4 times in DRILL-3846 where as what you see is huge order of
> magnitude
> different.
>
> You should file a JIRA and include details that will help us reproduce the
> problem.
> Please add as much information as possible.
> A sample dataset, how you are creating the table (i.e. partition info),
> logs, query profiles will be very helpful.
>
> Thanks,
> Padma
>
>
> > On Aug 20, 2017, at 7:03 PM, Divya Gehlot <[email protected]>
> wrote:
> >
> > Hi ,
> > Yes As Rahul mentioned I am running into a bug
> > https://issues.apache.org/jira/browse/DRILL-3846 ?
> >
> > As asked the usedMetadataFile is true once I run the Metadata cache
> query .
> > Any tentative or workaorund for the bug?
> >
> > Now my ask is if metadata cache is enabled the does Drill reads all the
> > files instead of intended ones ?
> > Is it Drill design strategy ?
> >
> > Thanks,
> > divya
> >
> > On 18 August 2017 at 12:13, Padma Penumarthy <[email protected]>
> wrote:
> >
> >> It is supposed to work like you expected. May be you are running into a
> >> bug.
> >> Why is it reading all files after metadata refresh ? That is difficult
> to
> >> answer without
> >> looking at the logs and query profile. If you look at the query profile,
> >> you can may
> >> be check what usedMetadataFile flag says for scan.
> >> Also, I am thinking if you created so many files, your metadata
> >> cache file could be big. May be you can manually sanity
> >> check if it looks ok (look for .drill.parquet.metadata file in the root
> >> directory) and not
> >> corrupted ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <[email protected]<mailto:
> kfara
> >> [email protected]>> wrote:
> >>
> >> Please share your SQL query and the query plan.
> >>
> >> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;
> >>
> >>
> >> Thanks,
> >>
> >> Khurram
> >>
> >> ________________________________
> >> From: Divya Gehlot <[email protected]<mailto:
> [email protected]
> >>>>
> >> Sent: Friday, August 18, 2017 7:15:18 AM
> >> To: [email protected]<mailto:[email protected]>
> >> Subject: Re: Query Optimization
> >>
> >> Hi ,
> >> Yes its the same query its just the ran the metadata refresh command .
> >> My understanding is metadata refresh command saves reading the metadata.
> >> How about column values ... Why is it reading all the files after
> metedata
> >> refresh ?
> >> Partition helps to retrieve data faster .
> >> Like in hive how it happens when you mention the partition column in
> where
> >> condition
> >> it just goes and read and improves the query performace .
> >> In my query also I where conidtion has  partioning column it should go
> and
> >> read those partitioned files right ?
> >> Why is it taking more time ?
> >> Does the Drill works in different way compare to hive ?
> >>
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 18 August 2017 at 07:37, Padma Penumarthy <[email protected]
> <mailto:
> >> [email protected]>> wrote:
> >>
> >> It might read all those files if some new data gets added after running
> >> refresh metadata cache.
> >> If everything is same before and after metadata refresh i.e. no
> >> new data added and query is exactly the same, then it should not do
> that.
> >> Also, check if you can partition in  a way that will not create so many
> >> files in the
> >> first place.
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <[email protected]<
> >> mailto:[email protected]>>
> >> wrote:
> >>
> >> Hi,
> >> Another observation is
> >> My query had where conditions based on the partition values
> >>
> >> Total number of parquet files in directory  - 102290
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >>
> >>
> >> This is how the refresh metadata works I mean it scans each and every
> >> files
> >> and get the results ?
> >>
> >> I dont  have access to logs now .
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 17 August 2017 at 13:48, Divya Gehlot <[email protected]<
> mailto:
> >> [email protected]>>
> >> wrote:
> >>
> >> Hi,
> >> Another observation is
> >> My query had where conditions based on the partition values
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 17 August 2017 at 13:03, Padma Penumarthy <[email protected]
> <mailto:
> >> [email protected]>>
> >> wrote:
> >>
> >> Does your query have partition filter ?
> >> Execution time is increased most likely because partition pruning is
> >> not
> >> happening.
> >> Did you get a chance to look at the logs ?  That might give some clues.
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <[email protected]<
> mailto:
> >> [email protected]>>
> >> wrote:
> >>
> >> Hi,
> >> Even I am surprised .
> >> I am running Drill version 1.10  on MapR enterprise version.
> >> *Query *- Selecting all the columns on partitioned parquet table
> >>
> >> I observed few things from Query statistics :
> >>
> >> Value
> >>
> >> Before Refresh Metadata
> >>
> >> After Refresh Metadata
> >>
> >> Fragments
> >>
> >> 1
> >>
> >> 13
> >>
> >> DURATION
> >>
> >> 01 min 0.233 sec
> >>
> >> 18 min 0.744 sec
> >>
> >> PLANNING
> >>
> >> 59.818 sec
> >>
> >> 33.087 sec
> >>
> >> QUEUED
> >>
> >> Not Available
> >>
> >> Not Available
> >>
> >> EXECUTION
> >>
> >> 0.415 sec
> >>
> >> 17 min 27.657 sec
> >>
> >> The planning time is being reduced by approx 60% but the execution
> >> time
> >> increased  drastically.
> >> I would like to understand why the exceution time increases after the
> >> metadata refresh .
> >>
> >>
> >> Appreciate the help.
> >>
> >> Thanks,
> >> divya
> >>
> >>
> >> On 17 August 2017 at 11:54, Padma Penumarthy <[email protected]
> <mailto:
> >> [email protected]>>
> >> wrote:
> >>
> >> Refresh table metadata should  help reduce query planning time.
> >> It is odd that it went up after you did refresh table metadata.
> >> Did you check the logs to see what is happening ? You might have to
> >> turn on some debugs if needed.
> >> BTW, what version of Drill are you running ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <[email protected]<
> mailto:
> >> [email protected]>>
> >> wrote:
> >>
> >> Hi,
> >> I have data in parquet file format .
> >> when I run the query the data and see the execution plan I could see
> >> following
> >> statistics
> >>
> >> TOTAL FRAGMENTS: 1
> >> DURATION: 01 min 0.233 sec
> >> PLANNING: 59.818 sec
> >> QUEUED: Not Available
> >> EXECUTION: 0.415 sec
> >>
> >>
> >>
> >> As its a paquet file format I tried enabling refresh meta data
> >> and run below command
> >> REFRESH TABLE METADATA <path to table> ;
> >> then run the same query again on the same table same data (no
> >> changes
> >> in
> >> data)  and could find the statistics as show below :
> >>
> >> TOTAL FRAGMENTS: 13
> >> DURATION: 14 min 14.604 sec
> >> PLANNING: 33.087 sec
> >> QUEUED: Not Available
> >> EXECUTION: Not Available
> >>
> >>
> >> The query is still running .
> >>
> >> Can somebody help me  understand why the query taking so long once I
> >> issue
> >> the refresh metadata command.
> >>
> >> Aprreciate the help !
> >>
> >> Thanks,
> >> Divya
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>

Re: Query Optimization

Reply via email to