It might read all those files if some new data gets added after running refresh metadata cache. If everything is same before and after metadata refresh i.e. no new data added and query is exactly the same, then it should not do that. Also, check if you can partition in a way that will not create so many files in the first place.
Thanks, Padma > On Aug 16, 2017, at 10:54 PM, Divya Gehlot <[email protected]> wrote: > > Hi, > Another observation is > My query had where conditions based on the partition values > > Total number of parquet files in directory - 102290 >> Before Metadata refresh - Its reading only 4 files >> After metadata refresh - its reading 102290 files > > > This is how the refresh metadata works I mean it scans each and every files > and get the results ? > > I dont have access to logs now . > > Thanks, > Divya > > On 17 August 2017 at 13:48, Divya Gehlot <[email protected]> wrote: > >> Hi, >> Another observation is >> My query had where conditions based on the partition values >> Before Metadata refresh - Its reading only 4 files >> After metadata refresh - its reading 102290 files >> >> Thanks, >> Divya >> >> On 17 August 2017 at 13:03, Padma Penumarthy <[email protected]> wrote: >> >>> Does your query have partition filter ? >>> Execution time is increased most likely because partition pruning is not >>> happening. >>> Did you get a chance to look at the logs ? That might give some clues. >>> >>> Thanks, >>> Padma >>> >>> >>>> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <[email protected]> >>> wrote: >>>> >>>> Hi, >>>> Even I am surprised . >>>> I am running Drill version 1.10 on MapR enterprise version. >>>> *Query *- Selecting all the columns on partitioned parquet table >>>> >>>> I observed few things from Query statistics : >>>> >>>> Value >>>> >>>> Before Refresh Metadata >>>> >>>> After Refresh Metadata >>>> >>>> Fragments >>>> >>>> 1 >>>> >>>> 13 >>>> >>>> DURATION >>>> >>>> 01 min 0.233 sec >>>> >>>> 18 min 0.744 sec >>>> >>>> PLANNING >>>> >>>> 59.818 sec >>>> >>>> 33.087 sec >>>> >>>> QUEUED >>>> >>>> Not Available >>>> >>>> Not Available >>>> >>>> EXECUTION >>>> >>>> 0.415 sec >>>> >>>> 17 min 27.657 sec >>>> >>>> The planning time is being reduced by approx 60% but the execution time >>>> increased drastically. >>>> I would like to understand why the exceution time increases after the >>>> metadata refresh . >>>> >>>> >>>> Appreciate the help. >>>> >>>> Thanks, >>>> divya >>>> >>>> >>>> On 17 August 2017 at 11:54, Padma Penumarthy <[email protected]> >>> wrote: >>>> >>>>> Refresh table metadata should help reduce query planning time. >>>>> It is odd that it went up after you did refresh table metadata. >>>>> Did you check the logs to see what is happening ? You might have to >>>>> turn on some debugs if needed. >>>>> BTW, what version of Drill are you running ? >>>>> >>>>> Thanks, >>>>> Padma >>>>> >>>>> >>>>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> I have data in parquet file format . >>>>>> when I run the query the data and see the execution plan I could see >>>>>> following >>>>>> statistics >>>>>> >>>>>>> TOTAL FRAGMENTS: 1 >>>>>>>> DURATION: 01 min 0.233 sec >>>>>>>> PLANNING: 59.818 sec >>>>>>>> QUEUED: Not Available >>>>>>>> EXECUTION: 0.415 sec >>>>>>> >>>>>>> >>>>>> >>>>>> As its a paquet file format I tried enabling refresh meta data >>>>>> and run below command >>>>>> REFRESH TABLE METADATA <path to table> ; >>>>>> then run the same query again on the same table same data (no changes >>> in >>>>>> data) and could find the statistics as show below : >>>>>> >>>>>> TOTAL FRAGMENTS: 13 >>>>>>>> DURATION: 14 min 14.604 sec >>>>>>>> PLANNING: 33.087 sec >>>>>>>> QUEUED: Not Available >>>>>>>> EXECUTION: Not Available >>>>>>> >>>>>>> >>>>>> The query is still running . >>>>>> >>>>>> Can somebody help me understand why the query taking so long once I >>>>> issue >>>>>> the refresh metadata command. >>>>>> >>>>>> Aprreciate the help ! >>>>>> >>>>>> Thanks, >>>>>> Divya >>>>> >>>>> >>> >>> >>
