>>> So this is a Scan or a Get? I am using Scan with a filter. It would check for rows which satisfy the condition, say, B<C i.e. all values which are greater than B.
>>> What's D? Or you meant C? And to find all those values, is it another scan/get on a third table that gives you your D that you insert in a fourth table? I meant C :). Yes, I am doing another scan to check the other condition (C<D). >>> Do you really need 3-4 tables? No, not required. But I did that to take away another level of filtering. Initially, I did keep all the data in one table, then used a PrefixFilter to first select the possible rows on which further filters can be applied. In order to reduce a filter, I moved data to different tables. >>> What's your dataset like? Data is a large set of axioms i.e. each axiom satisfies one of the rules. So while processing a rule, I need to filter the data and obtain the subset which is suitable for that particular rule. For eg: 1000 axioms and 10 rules. Now, each of the 1000 axioms satisfies one of the rules. Say I am processing rule-3 and it is applicable to only 200 axioms. I use the filters (mentioned in previous mails) to obtain the subset. In general terms, I think it can be stated as, given some data, I have to obtain a subset of the data which satisfies several constraints and do some processing on that subset. What is the frequency of usage of filters and reads in a MR job? I think my case is the extreme case where I use filter on each row that is mapped. In general are filters used rarely from within a map()? Regards, Raghava. On Wed, Jun 23, 2010 at 1:39 PM, Jean-Daniel Cryans <[email protected]>wrote: > Inline. > > J-D > > On Wed, Jun 23, 2010 at 10:16 AM, Raghava Mutharaju > <[email protected]> wrote: > > Hello JD, > > > > Thank you for the response. > > > >>>> Are the Gets done on the same row that is mapped? Or on the same > table? > > Or another table? > > By row that is mapped, does it mean the row that is given to the map() > > method as a <K,V> pair? Then no, the data from this row is used to > construct > > a filter which is applied on another table. No, it is not on the same > table > > that this row has come from. > > Ok thanks > > > > >>>> Can you give a real example of what you are trying to achieve? > > It is similar to a rule engine. I have to take in the data, apply some > rules > > on it and generate new data. These rules can be taken as "if..then" > > statements with multiple conditions in "if". I have to check which subset > of > > data satisfies these conditions to apply the "then" part. > > Eg: Transitive property. if (A<B and B<C and C<D) then A<D > > > > For implementing this I am using multiple filters. For the initial scan > > which forms the InputSplit to the maps, I put in the first filter (say > > something like get all the values which are > A). Then in the map, I > would > > take in the values (say B) and for each value, I have to put in 2 more > > filters > > Filter-1: Find all values (say C) which are greater than the B in above > > step. > > So this is a Scan or a Get? Normally if you'd want to find all the > rows that have some value > B then you'd do another MR job right? Else > this is a full table scan? > > > Filter-2: For each value obtained as output (which is designated as C) of > > Filter-1, find values greater than D. > > Now take the output of Filter-2 and write it out into a table. > > What's D? Or you meant C? And to find all those values, is it another > scan/get on a third table that gives you your D that you insert in a > fourth table? > > > > > Since there are multiple reads involved with each row received by the > map, > > it is slow. Is there any way to improve the speed? or is this type of > > problem not suitable for HBase/Hadoop? > > I specifically asked you for an example involving data, because here > you show us a potential solution that you are trying to optimize while > not giving us a full problem statement. Do you really need 3-4 tables? > What's your dataset like? Can we think of another way of doing this? > Currently as far I as understand your problem, it won't scale since > what you are doing is basically O(n^c). >
