This is not super clear, some comments inline. J-D
On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju <[email protected]> wrote: > Hello all, > > In the data, I have to check for multiple conditions and then work > with the data that satisfies all the conditions. I am doing this as an MR > job with no reduce and the conditions are translated to a set of filters. > Among the multiple conditions (2 or 3 max), data that satisfies one of them > would come as input to the Map (initial filter is set in the scan to the > mappers). Now, from among the dataset that comes through to each map, I > would check for other conditions (1 or 2 remaining conditions). Since map() > is called for each row of data, it would mean 1 or 2 read calls (with > filter) to HBase tables. This setup, even for small data (data would fit in Here you talk about checking 1-2 two conditions... are they checked on the row that was mapped? Else that means that you are doing 1-2 Get per row? If so, this is definitely going to be slow! > a region and so only 1 map is taking in all the data) is very slow. What do you mean? That currently your test is done on 1 region but you expect more? If not, then don't use MR since that would give you nothing more than more code to write and more processing time. > > Here, note that, I shouldn't be filtering the incoming data to map but based > on that data, next set of filtering conditions would be formed. Can you give an example? > > Can this be improved? Would constructing secondary indexes help (would need > a dramatic improvement actually)? Or is this type of problem not suitable > for HBase? > > Thank you. > > Regards, > Raghava. >
