Thanks for the replies..
Matteo,
We'r running 94.6 since February so, sadly the prod cluster doesn't have
this SKIP_FLUSH option right now. Would be great if there are options I
could use right now until we upgrade to 98.
Ted,
Thanks for the jira. That is exactly what we intend to use for running
the MR jobs over snapshots. Just wanted to know how easy/lightweight
snapshotting can be before we set our eyes on moving the whole thing over.
Cheers,
-Gautam.
On Tue, Aug 12, 2014 at 3:24 PM, Ted Yu <[email protected]> wrote:
> Gautum:
> Please take a look at this:
> HBASE-8369 MapReduce over snapshot files
>
> Cheers
>
>
> On Tue, Aug 12, 2014 at 3:11 PM, Matteo Bertozzi <[email protected]>
> wrote:
>
> > There is HBASE-10935, included in 0.94.21 where you can specify to skip
> > the memstore flush and the result will be the online version of an
> "offline
> > snapshot"
> >
> >
> > snapshot 'sourceTable', 'snapshotName', {SKIP_FLUSH => true}
> >
> >
> >
> > On Tue, Aug 12, 2014 at 10:58 PM, Gautam <[email protected]>
> wrote:
> >
> > > Hello,
> > >
> > > We'v been using and loving Hbase for couple of months now. Our
> > primary
> > > usecase for Hbase is writing events in stream to an online time series
> > > Hbase table. Every so often we run medium to large batch scan MR jobs
> on
> > > sections (1hour, 1 day, 1 week) of this same time series table. This
> > > online table is now showing spikes whenever these large batched read
> jobs
> > > are run. Write throughput goes down while these sequential scans are
> > > running on the table.
> > >
> > > We'v been playing around with snapshots and are considering using
> > snapshots
> > > to take over the responsibility for running these scheduled hourly,
> > daily,
> > > weekly jobs so that the online table doesn't get affected. From
> > preliminary
> > > tests it looks like online snapshots take waay too long. The snapshot
> job
> > > times out after 60secs. The time was spent flushing the memstores on
> all
> > > region servers (as expected) which seems to take too long. Also it
> seems
> > > from the RS logs like this is done serially.
> > >
> > > Offline snapshots isn't an option since we can't disable this table
> which
> > > serves the event writing.
> > >
> > > *We'r running Hbase 94.6. Tried benchmarking snapshotting on a 9TB
> Table
> > > with 240 regions, 1 Column Family, 4 region servers. *
> > >
> > > All in all, I'd like to ask if things would improve if we upgraded to
> > Hbase
> > > 0.98.+ Are there known benchmark numbers on expected snapshot
> performance
> > > for 94.+ vs. 98.+ ? In an ideal scenario we'd like these MR jobs to
> > > dynamically take a snapshot, run the job, delete/re-use the snapshot
> > based
> > > on freshness. At the least, we need the snapshot to be fresh until the
> > last
> > > hour.
> > >
> > > Also from what I understand in Hbase, scans are not consistent at the
> > table
> > > level but are at the row level. Are there other ways I can query the
> > online
> > > table without hurting the write throughput?
> > >
> > > Cheers,
> > > -Gautam.
> > >
> >
>
--
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."