Gautum: Please take a look at this: HBASE-8369 MapReduce over snapshot files
Cheers On Tue, Aug 12, 2014 at 3:11 PM, Matteo Bertozzi <[email protected]> wrote: > There is HBASE-10935, included in 0.94.21 where you can specify to skip > the memstore flush and the result will be the online version of an "offline > snapshot" > > > snapshot 'sourceTable', 'snapshotName', {SKIP_FLUSH => true} > > > > On Tue, Aug 12, 2014 at 10:58 PM, Gautam <[email protected]> wrote: > > > Hello, > > > > We'v been using and loving Hbase for couple of months now. Our > primary > > usecase for Hbase is writing events in stream to an online time series > > Hbase table. Every so often we run medium to large batch scan MR jobs on > > sections (1hour, 1 day, 1 week) of this same time series table. This > > online table is now showing spikes whenever these large batched read jobs > > are run. Write throughput goes down while these sequential scans are > > running on the table. > > > > We'v been playing around with snapshots and are considering using > snapshots > > to take over the responsibility for running these scheduled hourly, > daily, > > weekly jobs so that the online table doesn't get affected. From > preliminary > > tests it looks like online snapshots take waay too long. The snapshot job > > times out after 60secs. The time was spent flushing the memstores on all > > region servers (as expected) which seems to take too long. Also it seems > > from the RS logs like this is done serially. > > > > Offline snapshots isn't an option since we can't disable this table which > > serves the event writing. > > > > *We'r running Hbase 94.6. Tried benchmarking snapshotting on a 9TB Table > > with 240 regions, 1 Column Family, 4 region servers. * > > > > All in all, I'd like to ask if things would improve if we upgraded to > Hbase > > 0.98.+ Are there known benchmark numbers on expected snapshot performance > > for 94.+ vs. 98.+ ? In an ideal scenario we'd like these MR jobs to > > dynamically take a snapshot, run the job, delete/re-use the snapshot > based > > on freshness. At the least, we need the snapshot to be fresh until the > last > > hour. > > > > Also from what I understand in Hbase, scans are not consistent at the > table > > level but are at the row level. Are there other ways I can query the > online > > table without hurting the write throughput? > > > > Cheers, > > -Gautam. > > >
