answers inline. On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <[email protected]> wrote:
> Working with the 1.7.0, the feature that I was very interested in was the > fixing of the Metadata Caching while using user impersonation. > > I have a large table, with a day directory that can contain up to 1000 > parquet files each. > > > Planning was getting terrible on this table as I added new data, and the > metadata cache wasn't an option for me because of impersonation. > > Well now will 1.7.0 that's working, and it makes a HUGE difference. A query > that would take 120 seconds now takes 20 seconds. Etc. > > Overall, this is a great feature and folks should look into it for > performance of large Parquet tables. > > Some observations that I would love some help with. > > 1. Drill "Seems" to know when a new subdirectory was added and it generates > the metadata for that directory with the missing data. This is without > another REFRESH TABLE METADATA command. That works great for new > directories, however, what happens if you just copy new files into an > existing directory? Will it use the metadata cache that only lists the old > files. or will things get updated? I guess, how does it know things are in > sync? > When you query folder A that contains metadata cache, Drill will check all it's sub-directories' last modification time to figure out if anything changed since last time the metadata cache was refreshed. If data was added/removed, Drill will refresh the metadata cache for folder A. > 2. Pertaining to point 1, when new data was added, the first query that > used that directory partition, seemed to write the metadata file. However, > the second query ran ALSO rewrote the file (and it ran with the speed of an > uncached directory). However, the third query was now running at cached > speeds. (the 20 seconds vs. 120 seconds). This seems odd, but maybe there > is an reason? > Unfortunately, the current implementation of metadata cache doesn't support incremental refresh, so each time Drill detects a change inside the folder, it will run a "full" metadata cache refresh before running the query, that's what explains why your second query took so long to finish. > 3. Is Drill ok with me running REFRESH TABLE METADATA only for > subdirectory? So if I load a day, can I issue REFRESH TABLE METADATA > `mytable/2016-07-04` and have things be all where drill is happy? I.e. > does the mytable metadata need to be updated as well or is that wasted > cycles? > Drill keeps a metadata cache file for every subdirectory of your table. So you'll end up with a cache file in "mytable" and another one in "mytable/2016-07-04". I'm not sure about the following, and other developers will correct soon enough, but my understanding is that you can run a refresh command on the subfolder and it will only cause that particular cache (and any of it's subfolders) to be updated and it won't cause the cache file on "mytable" and any other of it's subfolders to be updated. Also, as long as you only query this particular day, Drill won't detect the change and won't try to update any other metadata cache, but as soon as you query "mytable" Drill will figure out things have changed and it will cause a full refresh of the table. > 4. Discussion: perhaps we could compress the metadata file? Each day (for > me) has 8.2 mb of data, and the file at the root of my table has 332mb of > data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That > seems like an improvement, however, not knowing how this file is > used/updated compression may add lag. > There are definitely other ways we can store the metadata cache files, compression is one of them but we also want the alternative to make it easier to run incremental metadata refresh. > 5. Any other thoughts/suggestions? > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
