Hi , Here is the JIRA DRILL-5739 <https://issues.apache.org/jira/browse/DRILL-5739> link of the issue .
Thanks, Divya On 22 August 2017 at 04:28, Padma Penumarthy <[email protected]> wrote: > That is definitely not the design strategy. Also, I don’t think what you > are seeing is same > as DRILL-3846. The difference between with and without metadata caching > is a > factor of 2-4 times in DRILL-3846 where as what you see is huge order of > magnitude > different. > > You should file a JIRA and include details that will help us reproduce the > problem. > Please add as much information as possible. > A sample dataset, how you are creating the table (i.e. partition info), > logs, query profiles will be very helpful. > > Thanks, > Padma > > > > On Aug 20, 2017, at 7:03 PM, Divya Gehlot <[email protected]> > wrote: > > > > Hi , > > Yes As Rahul mentioned I am running into a bug > > https://issues.apache.org/jira/browse/DRILL-3846 ? > > > > As asked the usedMetadataFile is true once I run the Metadata cache > query . > > Any tentative or workaorund for the bug? > > > > Now my ask is if metadata cache is enabled the does Drill reads all the > > files instead of intended ones ? > > Is it Drill design strategy ? > > > > Thanks, > > divya > > > > On 18 August 2017 at 12:13, Padma Penumarthy <[email protected]> > wrote: > > > >> It is supposed to work like you expected. May be you are running into a > >> bug. > >> Why is it reading all files after metadata refresh ? That is difficult > to > >> answer without > >> looking at the logs and query profile. If you look at the query profile, > >> you can may > >> be check what usedMetadataFile flag says for scan. > >> Also, I am thinking if you created so many files, your metadata > >> cache file could be big. May be you can manually sanity > >> check if it looks ok (look for .drill.parquet.metadata file in the root > >> directory) and not > >> corrupted ? > >> > >> Thanks, > >> Padma > >> > >> > >> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <[email protected]<mailto: > kfara > >> [email protected]>> wrote: > >> > >> Please share your SQL query and the query plan. > >> > >> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>; > >> > >> > >> Thanks, > >> > >> Khurram > >> > >> ________________________________ > >> From: Divya Gehlot <[email protected]<mailto: > [email protected] > >>>> > >> Sent: Friday, August 18, 2017 7:15:18 AM > >> To: [email protected]<mailto:[email protected]> > >> Subject: Re: Query Optimization > >> > >> Hi , > >> Yes its the same query its just the ran the metadata refresh command . > >> My understanding is metadata refresh command saves reading the metadata. > >> How about column values ... Why is it reading all the files after > metedata > >> refresh ? > >> Partition helps to retrieve data faster . > >> Like in hive how it happens when you mention the partition column in > where > >> condition > >> it just goes and read and improves the query performace . > >> In my query also I where conidtion has partioning column it should go > and > >> read those partitioned files right ? > >> Why is it taking more time ? > >> Does the Drill works in different way compare to hive ? > >> > >> > >> Thanks, > >> Divya > >> > >> On 18 August 2017 at 07:37, Padma Penumarthy <[email protected] > <mailto: > >> [email protected]>> wrote: > >> > >> It might read all those files if some new data gets added after running > >> refresh metadata cache. > >> If everything is same before and after metadata refresh i.e. no > >> new data added and query is exactly the same, then it should not do > that. > >> Also, check if you can partition in a way that will not create so many > >> files in the > >> first place. > >> > >> Thanks, > >> Padma > >> > >> > >> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <[email protected]< > >> mailto:[email protected]>> > >> wrote: > >> > >> Hi, > >> Another observation is > >> My query had where conditions based on the partition values > >> > >> Total number of parquet files in directory - 102290 > >> Before Metadata refresh - Its reading only 4 files > >> After metadata refresh - its reading 102290 files > >> > >> > >> This is how the refresh metadata works I mean it scans each and every > >> files > >> and get the results ? > >> > >> I dont have access to logs now . > >> > >> Thanks, > >> Divya > >> > >> On 17 August 2017 at 13:48, Divya Gehlot <[email protected]< > mailto: > >> [email protected]>> > >> wrote: > >> > >> Hi, > >> Another observation is > >> My query had where conditions based on the partition values > >> Before Metadata refresh - Its reading only 4 files > >> After metadata refresh - its reading 102290 files > >> > >> Thanks, > >> Divya > >> > >> On 17 August 2017 at 13:03, Padma Penumarthy <[email protected] > <mailto: > >> [email protected]>> > >> wrote: > >> > >> Does your query have partition filter ? > >> Execution time is increased most likely because partition pruning is > >> not > >> happening. > >> Did you get a chance to look at the logs ? That might give some clues. > >> > >> Thanks, > >> Padma > >> > >> > >> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <[email protected]< > mailto: > >> [email protected]>> > >> wrote: > >> > >> Hi, > >> Even I am surprised . > >> I am running Drill version 1.10 on MapR enterprise version. > >> *Query *- Selecting all the columns on partitioned parquet table > >> > >> I observed few things from Query statistics : > >> > >> Value > >> > >> Before Refresh Metadata > >> > >> After Refresh Metadata > >> > >> Fragments > >> > >> 1 > >> > >> 13 > >> > >> DURATION > >> > >> 01 min 0.233 sec > >> > >> 18 min 0.744 sec > >> > >> PLANNING > >> > >> 59.818 sec > >> > >> 33.087 sec > >> > >> QUEUED > >> > >> Not Available > >> > >> Not Available > >> > >> EXECUTION > >> > >> 0.415 sec > >> > >> 17 min 27.657 sec > >> > >> The planning time is being reduced by approx 60% but the execution > >> time > >> increased drastically. > >> I would like to understand why the exceution time increases after the > >> metadata refresh . > >> > >> > >> Appreciate the help. > >> > >> Thanks, > >> divya > >> > >> > >> On 17 August 2017 at 11:54, Padma Penumarthy <[email protected] > <mailto: > >> [email protected]>> > >> wrote: > >> > >> Refresh table metadata should help reduce query planning time. > >> It is odd that it went up after you did refresh table metadata. > >> Did you check the logs to see what is happening ? You might have to > >> turn on some debugs if needed. > >> BTW, what version of Drill are you running ? > >> > >> Thanks, > >> Padma > >> > >> > >> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <[email protected]< > mailto: > >> [email protected]>> > >> wrote: > >> > >> Hi, > >> I have data in parquet file format . > >> when I run the query the data and see the execution plan I could see > >> following > >> statistics > >> > >> TOTAL FRAGMENTS: 1 > >> DURATION: 01 min 0.233 sec > >> PLANNING: 59.818 sec > >> QUEUED: Not Available > >> EXECUTION: 0.415 sec > >> > >> > >> > >> As its a paquet file format I tried enabling refresh meta data > >> and run below command > >> REFRESH TABLE METADATA <path to table> ; > >> then run the same query again on the same table same data (no > >> changes > >> in > >> data) and could find the statistics as show below : > >> > >> TOTAL FRAGMENTS: 13 > >> DURATION: 14 min 14.604 sec > >> PLANNING: 33.087 sec > >> QUEUED: Not Available > >> EXECUTION: Not Available > >> > >> > >> The query is still running . > >> > >> Can somebody help me understand why the query taking so long once I > >> issue > >> the refresh metadata command. > >> > >> Aprreciate the help ! > >> > >> Thanks, > >> Divya > >> > >> > >> > >> > >> > >> > >> > >> > >> > >
