This is not super clear, some comments inline.

J-D

On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju
<[email protected]> wrote:
> Hello all,
>
>      In the data, I have to check for multiple conditions and then work
> with the data that satisfies all the conditions. I am doing this as an MR
> job with no reduce and the conditions are translated to a set of filters.
> Among the multiple conditions (2 or 3 max), data that satisfies one of them
> would come as input to the Map (initial filter is set in the scan to the
> mappers). Now, from among the dataset that comes through to each map, I
> would check for other conditions (1 or 2 remaining conditions). Since map()
> is called for each row of data, it would mean 1 or 2 read calls (with
> filter) to HBase tables. This setup, even for small data (data would fit in

Here you talk about checking 1-2 two conditions... are they checked on
the row that was mapped? Else that means that you are doing 1-2 Get
per row? If so, this is definitely going to be slow!

> a region and so only 1 map is taking in all the data) is very slow.

What do you mean? That currently your test is done on 1 region but you
expect more? If not, then don't use MR since that would give you
nothing more than more code to write and more processing time.

>
> Here, note that, I shouldn't be filtering the incoming data to map but based
> on that data, next set of filtering conditions would be formed.

Can you give an example?

>
> Can this be improved? Would constructing secondary indexes help (would need
> a dramatic improvement actually)? Or is this type of problem not suitable
> for HBase?
>
> Thank you.
>
> Regards,
> Raghava.
>

Reply via email to