So you should salt the keys in the reduce phase but u donot salt the keys in HBase. That basically means that reducers do not see the keys in sorted order but they do see all the values for a specific key together.
So the Hash essentially is a trick that stays within the mapreduce does not make it into HBase. This prevents you from hotspotting a region and since the Hash is not being written into HBase - you can do your prefix/regex scans etc. On Tue, Sep 24, 2013 at 4:16 PM, Jean-Marc Spaggiari < [email protected]> wrote: > If you have a fixed length like: > XXXX_AAAAAAA > > Where XXXX is a number from 0000 to 4000 and AAAA is your word, then simply > split by the number? > > Then when you will instead each line, it will write to 4000 different > regions, which can be hosted in 4000 different servers if you have that. > And there will be no hot-spotting? > > Then when you run MR job, you will have one mapper per region. Each region > will be evenly loaded. And if you start 200 jobs, then each region will > have 200 mappers, still evenly loaded, no? > > Or I might be missing something from your usecase. > > JM > > > 2013/9/24 jeremy p <[email protected]> > > > Varun : I'm familiar with that method of salting. However, in this > case, I > > need to do filtered range scans. When I do a lookup for a given WORD at > a > > given POSITION, I'll actually be doing a regex on a range of WORDs at > that > > POSITION. If I salt the keys with a hash, the WORDs will no longer be > > sorted, and so I would need to do a full table scan for every lookup. > > > > Jean-Marc : What problems do you see with my solution? Do you have a > > better suggestion? > > > > --Jeremy > > > > > > On Tue, Sep 24, 2013 at 3:16 PM, Varun Sharma <[email protected]> > wrote: > > > > > Its better to do some "salting" in your keys for the reduce phase. > > > Basically, make ur key be something like "KeyHash + Key" and then > decode > > it > > > in your reducer and write to HBase. This way you avoid the hotspotting > > > problem on HBase due to MapReduce sorting. > > > > > > > > > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari < > > > [email protected]> wrote: > > > > > > > Hi Jeremy, > > > > > > > > I don't see any issue for HBase to handle 4000 tables. However, I > don't > > > > think it's the best solution for your use case. > > > > > > > > JM > > > > > > > > > > > > 2013/9/24 jeremy p <[email protected]> > > > > > > > > > Short description : I'd like to have 4000 tables in my HBase > cluster. > > > > Will > > > > > this be a problem? In general, what problems do you run into when > > you > > > > try > > > > > to host thousands of tables in a cluster? > > > > > > > > > > Long description : I'd like the performance advantage of pre-split > > > > tables, > > > > > and I'd also like to do filtered range scans. Imagine a keyspace > > where > > > > the > > > > > key consists of : [POSITION]_[WORD] , where POSITION is a number > > from 1 > > > > to > > > > > 4000, and WORD is a string consisting of 96 characters. The value > in > > > the > > > > > cell would be a single integer. My app will examine a 'document', > > > where > > > > > each 'line' consists of 4000 WORDs. For each WORD, it'll do a > > filtered > > > > > regex lookup. Only problem? Say I have 200 mappers and they all > > start > > > > at > > > > > POSITION 1, my region servers would get hotspotted like crazy. So > my > > > idea > > > > > is to break it into 4000 tables (one for each POSITION), and then > > > > pre-split > > > > > the tables such that each region gets an equal amount of the > traffic. > > > In > > > > > this scenario, the key would just be WORD. Dunno if this a bad > idea, > > > > would > > > > > be open to suggestions > > > > > > > > > > Thanks! > > > > > > > > > > --J > > > > > > > > > > > > > > >
