Working with the 1.7.0, the feature that I was very interested in was the fixing of the Metadata Caching while using user impersonation.
I have a large table, with a day directory that can contain up to 1000 parquet files each. Planning was getting terrible on this table as I added new data, and the metadata cache wasn't an option for me because of impersonation. Well now will 1.7.0 that's working, and it makes a HUGE difference. A query that would take 120 seconds now takes 20 seconds. Etc. Overall, this is a great feature and folks should look into it for performance of large Parquet tables. Some observations that I would love some help with. 1. Drill "Seems" to know when a new subdirectory was added and it generates the metadata for that directory with the missing data. This is without another REFRESH TABLE METADATA command. That works great for new directories, however, what happens if you just copy new files into an existing directory? Will it use the metadata cache that only lists the old files. or will things get updated? I guess, how does it know things are in sync? 2. Pertaining to point 1, when new data was added, the first query that used that directory partition, seemed to write the metadata file. However, the second query ran ALSO rewrote the file (and it ran with the speed of an uncached directory). However, the third query was now running at cached speeds. (the 20 seconds vs. 120 seconds). This seems odd, but maybe there is an reason? 3. Is Drill ok with me running REFRESH TABLE METADATA only for subdirectory? So if I load a day, can I issue REFRESH TABLE METADATA `mytable/2016-07-04` and have things be all where drill is happy? I.e. does the mytable metadata need to be updated as well or is that wasted cycles? 4. Discussion: perhaps we could compress the metadata file? Each day (for me) has 8.2 mb of data, and the file at the root of my table has 332mb of data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That seems like an improvement, however, not knowing how this file is used/updated compression may add lag. 5. Any other thoughts/suggestions?
