Re: Initial Feed Back on 1.7.0 Release

Abdel Hakim Deneche Tue, 05 Jul 2016 10:21:32 -0700

Actually, I slightly misunderstood your 2nd question: so you made some
changes to a subfolder, then run query A that caused the cache to refresh,
then you run another query B that also caused the cache to refresh, the
finally query C actually seemed to use the cache as it is.


Is my understanding now correct ? are queries A and B exactly the same or
different ?

On Tue, Jul 5, 2016 at 10:13 AM, rahul challapalli <
[email protected]> wrote:

> John,
>
> Once you add/update data in one of your sub-folders, the immediate next
> query should update the metadata cache automatically and all subsequent
> queries should fetch metadata from the cache. If this is not the case, its
> a bug. Can you confirm your findings?
>
> - Rahul
>
> On Tue, Jul 5, 2016 at 9:53 AM, John Omernik <[email protected]> wrote:
>
> > Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
> > understood, nothing was changed, but then I had to make the third query
> for
> > it to take.  I'll keep observing to determine what that may be.
> >
> > On 3, a logical place to implement, or start implementing incremental may
> > be allowing a directories refresh automatically update the parents data
> > without causing a cascading (update everything) refresh.  So if if I
> have a
> > structure like this:
> >
> > mytable
> > ...dir0=2016-06-06
> > .......dir1=23
> >
> > (basically table, days, hours)
> >
> > that if I update data in hour 23, it would update 2016-06-06 with the new
> > timestamps and update mytable with the new timestamps.  The only issue
> > would be figuring out a way to take a lock. (Say you had multiple loads
> > happening, you want to ensure that one days updates don't clobber another
> > days)
> >
> > Just a thought on that.
> >
> > Yep, the incremental issue would come into play here.  Are there any
> design
> > docs or JIRAs on the incremental updates to metadata?
> >
> > Thanks for your reply.  I am looking forward other dev's thoughts on your
> > answer to 3 as well.
> >
> > Thanks!
> >
> > John
> >
> >
> > On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <
> > [email protected]>
> > wrote:
> >
> > > answers inline.
> > >
> > > On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <[email protected]> wrote:
> > >
> > > > Working with the 1.7.0, the feature that I was very interested in was
> > the
> > > > fixing of the Metadata Caching while using user impersonation.
> > > >
> > > > I have a large table, with a day directory that can contain up to
> 1000
> > > > parquet files each.
> > > >
> > > >
> > > > Planning was getting terrible on this table as I added new data, and
> > the
> > > > metadata cache wasn't an option for me because of impersonation.
> > > >
> > > > Well now will 1.7.0 that's working, and it makes a HUGE difference. A
> > > query
> > > > that would take 120 seconds now takes 20 seconds.   Etc.
> > > >
> > > > Overall, this is a great feature and folks should look into it for
> > > > performance of large Parquet tables.
> > > >
> > > > Some observations that I would love some help with.
> > > >
> > > > 1. Drill "Seems" to know when a new subdirectory was added and it
> > > generates
> > > > the metadata for that directory with the missing data. This is
> without
> > > > another REFRESH TABLE METADATA command.  That works great for new
> > > > directories, however, what happens if you just copy new files into an
> > > > existing directory? Will it use the metadata cache that only lists
> the
> > > old
> > > > files. or will things get updated? I guess, how does it know things
> are
> > > in
> > > > sync?
> > > >
> > >
> > > When you query folder A that contains metadata cache, Drill will check
> > all
> > > it's sub-directories' last modification time to figure out if anything
> > > changed since last time the metadata cache was refreshed. If data was
> > > added/removed, Drill will refresh the metadata cache for folder A.
> > >
> > >
> > > > 2.  Pertaining to point 1, when new data was added, the first query
> > that
> > > > used that directory partition, seemed to write the metadata file.
> > > However,
> > > > the second query ran ALSO rewrote the file (and it ran with the speed
> > of
> > > an
> > > > uncached directory).  However, the third query was now running at
> > cached
> > > > speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe
> > > there
> > > > is an reason?
> > > >
> > >
> > > Unfortunately, the current implementation of metadata cache doesn't
> > support
> > > incremental refresh, so each time Drill detects a change inside the
> > folder,
> > > it will run a "full" metadata cache refresh before running the query,
> > > that's what explains why your second query took so long to finish.
> > >
> > >
> > > > 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> > > > subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> > > > `mytable/2016-07-04`  and have things be all where drill is happy?
> > I.e.
> > > > does the mytable metadata need to be updated as well or is that
> wasted
> > > > cycles?
> > > >
> > >
> > > Drill keeps a metadata cache file for every subdirectory of your table.
> > So
> > > you'll end up with a cache file in "mytable" and another one in
> > > "mytable/2016-07-04".
> > > I'm not sure about the following, and other developers will correct
> soon
> > > enough, but my understanding is that you can run a refresh command on
> the
> > > subfolder and it will only cause that particular cache (and any of it's
> > > subfolders) to be updated and it won't cause the cache file on
> "mytable"
> > > and any other of it's subfolders to be updated.
> > > Also, as long as you only query this particular day, Drill won't detect
> > the
> > > change and won't try to update any other metadata cache, but as soon as
> > you
> > > query "mytable" Drill will figure out things have changed and it will
> > cause
> > > a full refresh of the table.
> > >
> > >
> > > > 4.  Discussion: perhaps we could compress the metadata file? Each day
> > > (for
> > > > me) has 8.2 mb of data, and the file at the root of my table has
> 332mb
> > of
> > > > data. Just using standard gzip/gunzip I got the 332mb file to 11 mb.
> > That
> > > > seems like an improvement, however, not knowing how this file is
> > > > used/updated compression may add lag.
> > > >
> > >
> > > There are definitely other ways we can store the metadata cache files,
> > > compression is one of them but we also want the alternative to make it
> > > easier to run incremental metadata refresh.
> > >
> > >
> > > > 5. Any other thoughts/suggestions?
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Initial Feed Back on 1.7.0 Release

Reply via email to