Inline. J-D
On Wed, Jun 23, 2010 at 10:16 AM, Raghava Mutharaju <[email protected]> wrote: > Hello JD, > > Thank you for the response. > >>>> Are the Gets done on the same row that is mapped? Or on the same table? > Or another table? > By row that is mapped, does it mean the row that is given to the map() > method as a <K,V> pair? Then no, the data from this row is used to construct > a filter which is applied on another table. No, it is not on the same table > that this row has come from. Ok thanks > >>>> Can you give a real example of what you are trying to achieve? > It is similar to a rule engine. I have to take in the data, apply some rules > on it and generate new data. These rules can be taken as "if..then" > statements with multiple conditions in "if". I have to check which subset of > data satisfies these conditions to apply the "then" part. > Eg: Transitive property. if (A<B and B<C and C<D) then A<D > > For implementing this I am using multiple filters. For the initial scan > which forms the InputSplit to the maps, I put in the first filter (say > something like get all the values which are > A). Then in the map, I would > take in the values (say B) and for each value, I have to put in 2 more > filters > Filter-1: Find all values (say C) which are greater than the B in above > step. So this is a Scan or a Get? Normally if you'd want to find all the rows that have some value > B then you'd do another MR job right? Else this is a full table scan? > Filter-2: For each value obtained as output (which is designated as C) of > Filter-1, find values greater than D. > Now take the output of Filter-2 and write it out into a table. What's D? Or you meant C? And to find all those values, is it another scan/get on a third table that gives you your D that you insert in a fourth table? > > Since there are multiple reads involved with each row received by the map, > it is slow. Is there any way to improve the speed? or is this type of > problem not suitable for HBase/Hadoop? I specifically asked you for an example involving data, because here you show us a potential solution that you are trying to optimize while not giving us a full problem statement. Do you really need 3-4 tables? What's your dataset like? Can we think of another way of doing this? Currently as far I as understand your problem, it won't scale since what you are doing is basically O(n^c).
