Re: multiple reads from a Map - optimization question

Jean-Daniel Cryans Wed, 23 Jun 2010 10:40:12 -0700

Inline.

J-D


On Wed, Jun 23, 2010 at 10:16 AM, Raghava Mutharaju
<[email protected]> wrote:
> Hello JD,
>
> Thank you for the response.
>
>>>> Are the Gets done on the same row that is mapped? Or on the same table?
> Or another table?
>    By row that is mapped, does it mean the row that is given to the map()
> method as a <K,V> pair? Then no, the data from this row is used to construct
> a filter which is applied on another table. No, it is not on the same table
> that this row has come from.

Ok thanks

>
>>>> Can you give a real example of what you are trying to achieve?
> It is similar to a rule engine. I have to take in the data, apply some rules
> on it and generate new data. These rules can be taken as "if..then"
> statements with multiple conditions in "if". I have to check which subset of
> data satisfies these conditions to apply the "then" part.
> Eg: Transitive property. if (A<B and B<C and C<D) then A<D
>
> For implementing this I am using multiple filters. For the initial scan
> which forms the InputSplit to the maps, I put in the first filter (say
> something like get all the values which are > A). Then in the map, I would
> take in the values (say B) and for each value, I have to put in 2 more
> filters
> Filter-1: Find all values (say C) which are greater than the B in above
> step.

So this is a Scan or a Get? Normally if you'd want to find all the
rows that have some value > B then you'd do another MR job right? Else
this is a full table scan?

> Filter-2: For each value obtained as output (which is designated as C) of
> Filter-1, find values greater than D.
> Now take the output of Filter-2 and write it out into a table.

What's D? Or you meant C? And to find all those values, is it another
scan/get on a third table that gives you your D that you insert in a
fourth table?

>
> Since there are multiple reads involved with each row received by the map,
> it is slow. Is there any way to improve the speed? or is this type of
> problem not suitable for HBase/Hadoop?

I specifically asked you for an example involving data, because here
you show us a potential solution that you are trying to optimize while
not giving us a full problem statement. Do you really need 3-4 tables?
What's your dataset like? Can we think of another way of doing this?
Currently as far I as understand your problem, it won't scale since
what you are doing is basically O(n^c).

Re: multiple reads from a Map - optimization question

Reply via email to