Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
understood, nothing was changed, but then I had to make the third query for
it to take.  I'll keep observing to determine what that may be.

On 3, a logical place to implement, or start implementing incremental may
be allowing a directories refresh automatically update the parents data
without causing a cascading (update everything) refresh.  So if if I have a
structure like this:

mytable
...dir0=2016-06-06
.......dir1=23

(basically table, days, hours)

that if I update data in hour 23, it would update 2016-06-06 with the new
timestamps and update mytable with the new timestamps.  The only issue
would be figuring out a way to take a lock. (Say you had multiple loads
happening, you want to ensure that one days updates don't clobber another
days)

Just a thought on that.

Yep, the incremental issue would come into play here.  Are there any design
docs or JIRAs on the incremental updates to metadata?

Thanks for your reply.  I am looking forward other dev's thoughts on your
answer to 3 as well.

Thanks!

John


On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <[email protected]>
wrote:

> answers inline.
>
> On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <[email protected]> wrote:
>
> > Working with the 1.7.0, the feature that I was very interested in was the
> > fixing of the Metadata Caching while using user impersonation.
> >
> > I have a large table, with a day directory that can contain up to 1000
> > parquet files each.
> >
> >
> > Planning was getting terrible on this table as I added new data, and the
> > metadata cache wasn't an option for me because of impersonation.
> >
> > Well now will 1.7.0 that's working, and it makes a HUGE difference. A
> query
> > that would take 120 seconds now takes 20 seconds.   Etc.
> >
> > Overall, this is a great feature and folks should look into it for
> > performance of large Parquet tables.
> >
> > Some observations that I would love some help with.
> >
> > 1. Drill "Seems" to know when a new subdirectory was added and it
> generates
> > the metadata for that directory with the missing data. This is without
> > another REFRESH TABLE METADATA command.  That works great for new
> > directories, however, what happens if you just copy new files into an
> > existing directory? Will it use the metadata cache that only lists the
> old
> > files. or will things get updated? I guess, how does it know things are
> in
> > sync?
> >
>
> When you query folder A that contains metadata cache, Drill will check all
> it's sub-directories' last modification time to figure out if anything
> changed since last time the metadata cache was refreshed. If data was
> added/removed, Drill will refresh the metadata cache for folder A.
>
>
> > 2.  Pertaining to point 1, when new data was added, the first query that
> > used that directory partition, seemed to write the metadata file.
> However,
> > the second query ran ALSO rewrote the file (and it ran with the speed of
> an
> > uncached directory).  However, the third query was now running at cached
> > speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe
> there
> > is an reason?
> >
>
> Unfortunately, the current implementation of metadata cache doesn't support
> incremental refresh, so each time Drill detects a change inside the folder,
> it will run a "full" metadata cache refresh before running the query,
> that's what explains why your second query took so long to finish.
>
>
> > 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> > subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> > `mytable/2016-07-04`  and have things be all where drill is happy?  I.e.
> > does the mytable metadata need to be updated as well or is that wasted
> > cycles?
> >
>
> Drill keeps a metadata cache file for every subdirectory of your table. So
> you'll end up with a cache file in "mytable" and another one in
> "mytable/2016-07-04".
> I'm not sure about the following, and other developers will correct soon
> enough, but my understanding is that you can run a refresh command on the
> subfolder and it will only cause that particular cache (and any of it's
> subfolders) to be updated and it won't cause the cache file on "mytable"
> and any other of it's subfolders to be updated.
> Also, as long as you only query this particular day, Drill won't detect the
> change and won't try to update any other metadata cache, but as soon as you
> query "mytable" Drill will figure out things have changed and it will cause
> a full refresh of the table.
>
>
> > 4.  Discussion: perhaps we could compress the metadata file? Each day
> (for
> > me) has 8.2 mb of data, and the file at the root of my table has 332mb of
> > data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That
> > seems like an improvement, however, not knowing how this file is
> > used/updated compression may add lag.
> >
>
> There are definitely other ways we can store the metadata cache files,
> compression is one of them but we also want the alternative to make it
> easier to run incremental metadata refresh.
>
>
> > 5. Any other thoughts/suggestions?
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Reply via email to