Actually, I slightly misunderstood your 2nd question: so you made some changes to a subfolder, then run query A that caused the cache to refresh, then you run another query B that also caused the cache to refresh, the finally query C actually seemed to use the cache as it is.
Is my understanding now correct ? are queries A and B exactly the same or different ? On Tue, Jul 5, 2016 at 10:13 AM, rahul challapalli < [email protected]> wrote: > John, > > Once you add/update data in one of your sub-folders, the immediate next > query should update the metadata cache automatically and all subsequent > queries should fetch metadata from the cache. If this is not the case, its > a bug. Can you confirm your findings? > > - Rahul > > On Tue, Jul 5, 2016 at 9:53 AM, John Omernik <[email protected]> wrote: > > > Hey Abdel, thanks for the response.. on questions 1 and 2, from what I > > understood, nothing was changed, but then I had to make the third query > for > > it to take. I'll keep observing to determine what that may be. > > > > On 3, a logical place to implement, or start implementing incremental may > > be allowing a directories refresh automatically update the parents data > > without causing a cascading (update everything) refresh. So if if I > have a > > structure like this: > > > > mytable > > ...dir0=2016-06-06 > > .......dir1=23 > > > > (basically table, days, hours) > > > > that if I update data in hour 23, it would update 2016-06-06 with the new > > timestamps and update mytable with the new timestamps. The only issue > > would be figuring out a way to take a lock. (Say you had multiple loads > > happening, you want to ensure that one days updates don't clobber another > > days) > > > > Just a thought on that. > > > > Yep, the incremental issue would come into play here. Are there any > design > > docs or JIRAs on the incremental updates to metadata? > > > > Thanks for your reply. I am looking forward other dev's thoughts on your > > answer to 3 as well. > > > > Thanks! > > > > John > > > > > > On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche < > > [email protected]> > > wrote: > > > > > answers inline. > > > > > > On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <[email protected]> wrote: > > > > > > > Working with the 1.7.0, the feature that I was very interested in was > > the > > > > fixing of the Metadata Caching while using user impersonation. > > > > > > > > I have a large table, with a day directory that can contain up to > 1000 > > > > parquet files each. > > > > > > > > > > > > Planning was getting terrible on this table as I added new data, and > > the > > > > metadata cache wasn't an option for me because of impersonation. > > > > > > > > Well now will 1.7.0 that's working, and it makes a HUGE difference. A > > > query > > > > that would take 120 seconds now takes 20 seconds. Etc. > > > > > > > > Overall, this is a great feature and folks should look into it for > > > > performance of large Parquet tables. > > > > > > > > Some observations that I would love some help with. > > > > > > > > 1. Drill "Seems" to know when a new subdirectory was added and it > > > generates > > > > the metadata for that directory with the missing data. This is > without > > > > another REFRESH TABLE METADATA command. That works great for new > > > > directories, however, what happens if you just copy new files into an > > > > existing directory? Will it use the metadata cache that only lists > the > > > old > > > > files. or will things get updated? I guess, how does it know things > are > > > in > > > > sync? > > > > > > > > > > When you query folder A that contains metadata cache, Drill will check > > all > > > it's sub-directories' last modification time to figure out if anything > > > changed since last time the metadata cache was refreshed. If data was > > > added/removed, Drill will refresh the metadata cache for folder A. > > > > > > > > > > 2. Pertaining to point 1, when new data was added, the first query > > that > > > > used that directory partition, seemed to write the metadata file. > > > However, > > > > the second query ran ALSO rewrote the file (and it ran with the speed > > of > > > an > > > > uncached directory). However, the third query was now running at > > cached > > > > speeds. (the 20 seconds vs. 120 seconds). This seems odd, but maybe > > > there > > > > is an reason? > > > > > > > > > > Unfortunately, the current implementation of metadata cache doesn't > > support > > > incremental refresh, so each time Drill detects a change inside the > > folder, > > > it will run a "full" metadata cache refresh before running the query, > > > that's what explains why your second query took so long to finish. > > > > > > > > > > 3. Is Drill ok with me running REFRESH TABLE METADATA only for > > > > subdirectory? So if I load a day, can I issue REFRESH TABLE METADATA > > > > `mytable/2016-07-04` and have things be all where drill is happy? > > I.e. > > > > does the mytable metadata need to be updated as well or is that > wasted > > > > cycles? > > > > > > > > > > Drill keeps a metadata cache file for every subdirectory of your table. > > So > > > you'll end up with a cache file in "mytable" and another one in > > > "mytable/2016-07-04". > > > I'm not sure about the following, and other developers will correct > soon > > > enough, but my understanding is that you can run a refresh command on > the > > > subfolder and it will only cause that particular cache (and any of it's > > > subfolders) to be updated and it won't cause the cache file on > "mytable" > > > and any other of it's subfolders to be updated. > > > Also, as long as you only query this particular day, Drill won't detect > > the > > > change and won't try to update any other metadata cache, but as soon as > > you > > > query "mytable" Drill will figure out things have changed and it will > > cause > > > a full refresh of the table. > > > > > > > > > > 4. Discussion: perhaps we could compress the metadata file? Each day > > > (for > > > > me) has 8.2 mb of data, and the file at the root of my table has > 332mb > > of > > > > data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. > > That > > > > seems like an improvement, however, not knowing how this file is > > > > used/updated compression may add lag. > > > > > > > > > > There are definitely other ways we can store the metadata cache files, > > > compression is one of them but we also want the alternative to make it > > > easier to run incremental metadata refresh. > > > > > > > > > > 5. Any other thoughts/suggestions? > > > > > > > > > > > > > > > > -- > > > > > > Abdelhakim Deneche > > > > > > Software Engineer > > > > > > <http://www.mapr.com/> > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > < > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
