Working with the 1.7.0, the feature that I was very interested in was the
fixing of the Metadata Caching while using user impersonation.

I have a large table, with a day directory that can contain up to 1000
parquet files each.


Planning was getting terrible on this table as I added new data, and the
metadata cache wasn't an option for me because of impersonation.

Well now will 1.7.0 that's working, and it makes a HUGE difference. A query
that would take 120 seconds now takes 20 seconds.   Etc.

Overall, this is a great feature and folks should look into it for
performance of large Parquet tables.

Some observations that I would love some help with.

1. Drill "Seems" to know when a new subdirectory was added and it generates
the metadata for that directory with the missing data. This is without
another REFRESH TABLE METADATA command.  That works great for new
directories, however, what happens if you just copy new files into an
existing directory? Will it use the metadata cache that only lists the old
files. or will things get updated? I guess, how does it know things are in
sync?

2.  Pertaining to point 1, when new data was added, the first query that
used that directory partition, seemed to write the metadata file. However,
the second query ran ALSO rewrote the file (and it ran with the speed of an
uncached directory).  However, the third query was now running at cached
speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe there
is an reason?

3. Is Drill ok with me running REFRESH TABLE METADATA only for
subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
`mytable/2016-07-04`  and have things be all where drill is happy?  I.e.
does the mytable metadata need to be updated as well or is that wasted
cycles?

4.  Discussion: perhaps we could compress the metadata file? Each day (for
me) has 8.2 mb of data, and the file at the root of my table has 332mb of
data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That
seems like an improvement, however, not knowing how this file is
used/updated compression may add lag.

5. Any other thoughts/suggestions?

Reply via email to