That is definitely not the design strategy. Also, I don’t think what you are seeing is same as DRILL-3846. The difference between with and without metadata caching is a factor of 2-4 times in DRILL-3846 where as what you see is huge order of magnitude different.
You should file a JIRA and include details that will help us reproduce the problem. Please add as much information as possible. A sample dataset, how you are creating the table (i.e. partition info), logs, query profiles will be very helpful. Thanks, Padma > On Aug 20, 2017, at 7:03 PM, Divya Gehlot <[email protected]> wrote: > > Hi , > Yes As Rahul mentioned I am running into a bug > https://issues.apache.org/jira/browse/DRILL-3846 ? > > As asked the usedMetadataFile is true once I run the Metadata cache query . > Any tentative or workaorund for the bug? > > Now my ask is if metadata cache is enabled the does Drill reads all the > files instead of intended ones ? > Is it Drill design strategy ? > > Thanks, > divya > > On 18 August 2017 at 12:13, Padma Penumarthy <[email protected]> wrote: > >> It is supposed to work like you expected. May be you are running into a >> bug. >> Why is it reading all files after metadata refresh ? That is difficult to >> answer without >> looking at the logs and query profile. If you look at the query profile, >> you can may >> be check what usedMetadataFile flag says for scan. >> Also, I am thinking if you created so many files, your metadata >> cache file could be big. May be you can manually sanity >> check if it looks ok (look for .drill.parquet.metadata file in the root >> directory) and not >> corrupted ? >> >> Thanks, >> Padma >> >> >> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <[email protected]<mailto:kfara >> [email protected]>> wrote: >> >> Please share your SQL query and the query plan. >> >> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>; >> >> >> Thanks, >> >> Khurram >> >> ________________________________ >> From: Divya Gehlot <[email protected]<mailto:[email protected] >>>> >> Sent: Friday, August 18, 2017 7:15:18 AM >> To: [email protected]<mailto:[email protected]> >> Subject: Re: Query Optimization >> >> Hi , >> Yes its the same query its just the ran the metadata refresh command . >> My understanding is metadata refresh command saves reading the metadata. >> How about column values ... Why is it reading all the files after metedata >> refresh ? >> Partition helps to retrieve data faster . >> Like in hive how it happens when you mention the partition column in where >> condition >> it just goes and read and improves the query performace . >> In my query also I where conidtion has partioning column it should go and >> read those partitioned files right ? >> Why is it taking more time ? >> Does the Drill works in different way compare to hive ? >> >> >> Thanks, >> Divya >> >> On 18 August 2017 at 07:37, Padma Penumarthy <[email protected]<mailto: >> [email protected]>> wrote: >> >> It might read all those files if some new data gets added after running >> refresh metadata cache. >> If everything is same before and after metadata refresh i.e. no >> new data added and query is exactly the same, then it should not do that. >> Also, check if you can partition in a way that will not create so many >> files in the >> first place. >> >> Thanks, >> Padma >> >> >> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <[email protected]< >> mailto:[email protected]>> >> wrote: >> >> Hi, >> Another observation is >> My query had where conditions based on the partition values >> >> Total number of parquet files in directory - 102290 >> Before Metadata refresh - Its reading only 4 files >> After metadata refresh - its reading 102290 files >> >> >> This is how the refresh metadata works I mean it scans each and every >> files >> and get the results ? >> >> I dont have access to logs now . >> >> Thanks, >> Divya >> >> On 17 August 2017 at 13:48, Divya Gehlot <[email protected]<mailto: >> [email protected]>> >> wrote: >> >> Hi, >> Another observation is >> My query had where conditions based on the partition values >> Before Metadata refresh - Its reading only 4 files >> After metadata refresh - its reading 102290 files >> >> Thanks, >> Divya >> >> On 17 August 2017 at 13:03, Padma Penumarthy <[email protected]<mailto: >> [email protected]>> >> wrote: >> >> Does your query have partition filter ? >> Execution time is increased most likely because partition pruning is >> not >> happening. >> Did you get a chance to look at the logs ? That might give some clues. >> >> Thanks, >> Padma >> >> >> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <[email protected]<mailto: >> [email protected]>> >> wrote: >> >> Hi, >> Even I am surprised . >> I am running Drill version 1.10 on MapR enterprise version. >> *Query *- Selecting all the columns on partitioned parquet table >> >> I observed few things from Query statistics : >> >> Value >> >> Before Refresh Metadata >> >> After Refresh Metadata >> >> Fragments >> >> 1 >> >> 13 >> >> DURATION >> >> 01 min 0.233 sec >> >> 18 min 0.744 sec >> >> PLANNING >> >> 59.818 sec >> >> 33.087 sec >> >> QUEUED >> >> Not Available >> >> Not Available >> >> EXECUTION >> >> 0.415 sec >> >> 17 min 27.657 sec >> >> The planning time is being reduced by approx 60% but the execution >> time >> increased drastically. >> I would like to understand why the exceution time increases after the >> metadata refresh . >> >> >> Appreciate the help. >> >> Thanks, >> divya >> >> >> On 17 August 2017 at 11:54, Padma Penumarthy <[email protected]<mailto: >> [email protected]>> >> wrote: >> >> Refresh table metadata should help reduce query planning time. >> It is odd that it went up after you did refresh table metadata. >> Did you check the logs to see what is happening ? You might have to >> turn on some debugs if needed. >> BTW, what version of Drill are you running ? >> >> Thanks, >> Padma >> >> >> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <[email protected]<mailto: >> [email protected]>> >> wrote: >> >> Hi, >> I have data in parquet file format . >> when I run the query the data and see the execution plan I could see >> following >> statistics >> >> TOTAL FRAGMENTS: 1 >> DURATION: 01 min 0.233 sec >> PLANNING: 59.818 sec >> QUEUED: Not Available >> EXECUTION: 0.415 sec >> >> >> >> As its a paquet file format I tried enabling refresh meta data >> and run below command >> REFRESH TABLE METADATA <path to table> ; >> then run the same query again on the same table same data (no >> changes >> in >> data) and could find the statistics as show below : >> >> TOTAL FRAGMENTS: 13 >> DURATION: 14 min 14.604 sec >> PLANNING: 33.087 sec >> QUEUED: Not Available >> EXECUTION: Not Available >> >> >> The query is still running . >> >> Can somebody help me understand why the query taking so long once I >> issue >> the refresh metadata command. >> >> Aprreciate the help ! >> >> Thanks, >> Divya >> >> >> >> >> >> >> >> >>
