These are both good suggestions -- thanks! Varun's suggestion will work if I can do all of my lookups at once. In fact, I wouldn't even need a reduce phase; I could just prepend all my keys with hashes, add them to some kind of sorted array, then pop them off one-by-one. Not sure if I can do all my lookups at once, though, for reasons I left out of my example for the sake of brevity.
Michael's suggestion will work if I can find a good way to partition the words such that they're evenly distributed. This may or may not be possible. In any case, I like both suggestions more than my original idea of using 4000 tables, so thank you both for the help! --Jeremy On Tue, Sep 24, 2013 at 10:16 PM, Michael Webster < [email protected]> wrote: > The biggest issue I see with so many tables is the region counts could get > quite large. With 4000 tables, you will need at least that many regions, > not even accounting for splitting the regions/growth. > > Forgive the speculation, but it almost sounds like you want an inverted > index. Could you not just store the word as the rowkey and have each > position be a column? 4000 columns isn't really that many, and if this is > text it will probably be pretty sparse i.e most words won't appear in all > 4K positions. When you set up the scanner for your MR job, add the column > for the position that you want. > > It seems that the downside of trying to push the reads into the reduce > phase by prepending a hash/salt to each real rowkey is, assuming the hashes > are well distributed, each reducer will get a random slice of the data and > will have to issue individual gets for their rowkeys instead of taking > advantage of HBase's scan performance. > > > On Tue, Sep 24, 2013 at 8:57 PM, Michael Segel <[email protected] > >wrote: > > > Since different people use different terms... Salting is BAD. (You need > to > > understand what is implied by the term salt.) > > > > What you really want to do is take the hash of the key, and then truncate > > the hash. Use that instead of a salt. > > > > Much better than a salt. > > > > Sent from a remote device. Please excuse any typos... > > > > Mike Segel > > > > > On Sep 24, 2013, at 5:17 PM, "Varun Sharma" <[email protected]> > wrote: > > > > > > Its better to do some "salting" in your keys for the reduce phase. > > > Basically, make ur key be something like "KeyHash + Key" and then > decode > > it > > > in your reducer and write to HBase. This way you avoid the hotspotting > > > problem on HBase due to MapReduce sorting. > > > > > > > > > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari < > > > [email protected]> wrote: > > > > > >> Hi Jeremy, > > >> > > >> I don't see any issue for HBase to handle 4000 tables. However, I > don't > > >> think it's the best solution for your use case. > > >> > > >> JM > > >> > > >> > > >> 2013/9/24 jeremy p <[email protected]> > > >> > > >>> Short description : I'd like to have 4000 tables in my HBase cluster. > > >> Will > > >>> this be a problem? In general, what problems do you run into when > you > > >> try > > >>> to host thousands of tables in a cluster? > > >>> > > >>> Long description : I'd like the performance advantage of pre-split > > >> tables, > > >>> and I'd also like to do filtered range scans. Imagine a keyspace > where > > >> the > > >>> key consists of : [POSITION]_[WORD] , where POSITION is a number > from 1 > > >> to > > >>> 4000, and WORD is a string consisting of 96 characters. The value in > > the > > >>> cell would be a single integer. My app will examine a 'document', > > where > > >>> each 'line' consists of 4000 WORDs. For each WORD, it'll do a > filtered > > >>> regex lookup. Only problem? Say I have 200 mappers and they all > start > > >> at > > >>> POSITION 1, my region servers would get hotspotted like crazy. So my > > idea > > >>> is to break it into 4000 tables (one for each POSITION), and then > > >> pre-split > > >>> the tables such that each region gets an equal amount of the traffic. > > In > > >>> this scenario, the key would just be WORD. Dunno if this a bad idea, > > >> would > > >>> be open to suggestions > > >>> > > >>> Thanks! > > >>> > > >>> --J > > >> > > > > > > -- > > *Michael Webster*, Software Engineer > Bronto Software > [email protected] > bronto.com <http://www.bronto.com/> > Marketing solutions for commerce. Learn more.< > http://www.bronto.com/platform> >
