Anything that involves regions being moved around (so, splitting and the rebalancer under normal operations) can cause full-table scans to fail in a variety of ways, unfortunately.
Didn't know Mozilla did this! Pretty cool. Doing the scans manually should work too, though of course having it encapsulated the way mozilla did it with their MultiScanInputFormat is nicer. It's schema-specific though.. not sure how to make that generally applicable. Maybe some sort of an interface you can implement to provide the date mapping? D On Tue, Sep 13, 2011 at 8:08 AM, Norbert Burger <[email protected]>wrote: > We tried using multiple LOADs because we want to minimize the data loaded > and take advantage of the pushdown filter support for -gte and -lte in > HBaseStorage. At the same time, a salted key schema forces different key > prefixes, so we ended up with 14 LOADs, one for each salted region. > > Doing some research, it seems like the Mozilla folks solved the issue in > Socorro by writing a custom LoadFunc: > > https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/load/HBaseMultiScanLoader.java > > The custom LoadFunc seems cleaner, since we can manipulate the 14 HBase > scanners directly, but at the cost of writing some Java glue code. Should > we expect however the 14 Pig LOADs also to work? > > I'll check and see why the scanners are timing out. We do have automatic > splitting turned on, but the region size is high enough (1 GB) that they > shouldn't be splitting often. The HBase rebalancer is probably turned on - > would this be enough to cause the timeouts? > > Norbert > > On Tue, Sep 13, 2011 at 10:43 AM, Dmitriy Ryaboy <[email protected]> > wrote: > > > Why not just one load? > > > > Check why the scanners are timing out. Are the regions splitting under > you > > while you scan? Do you have the hbase rebalancer turned on? > > > > On Sep 12, 2011, at 7:51 AM, Norbert Burger <[email protected]> > > wrote: > > > > > Folks -- we have a timeseries-based table we recently converted to a > > salted > > > key schema [1] in order to avoid region hotspotting. The rowkey format > > is: > > > > > > salt-timestamp-sessionid-eventtype, where: > > > > > > salt has the form 00..13, and the timestamp is a Unix timestamp (epoch > > > based). > > > > > > With the version 0.10.0 HBaseStorage, what's the recommended way to > LOAD > > a > > > salted schema from Pig? Initially, I thought we'd just fire off > multiple > > > LOADs, one for each region (in our case, up to 14), but we're hitting > > > frequently ScannerTimeoutExceptions with this approach, even on a > sample > > > script that does nothing but LOADs. > > > > > > Is there a better way? > > > > > > Thanks, > > > Norbert > > > > > > [1] > > > > > > http://ofps.oreilly.com/titles/9781449396107/advanced.html#ch09_id2336987 > > >
