Re: multiple reads from a Map - optimization question

Raghava Mutharaju Wed, 23 Jun 2010 11:47:20 -0700

>>> So this is a Scan or a Get?
I am using Scan with a filter. It would check for rows which satisfy the
condition, say, B<C i.e. all values which are greater than B.


>>> What's D? Or you meant C? And to find all those values, is it another
scan/get on a third table that gives you your D that you insert in a fourth
table?
I meant C :). Yes, I am doing another scan to check the other condition
(C<D).

>>> Do you really need 3-4 tables?
No, not required. But I did that to take away another level of filtering.
Initially, I did keep all the data in one table, then used a PrefixFilter to
first select the possible rows on which further filters can be applied. In
order to reduce a filter, I moved data to different tables.

>>> What's your dataset like?
Data is a large set of axioms i.e. each axiom satisfies one of the rules. So
while processing a rule, I need to filter the data and obtain the subset
which is suitable for that particular rule.
For eg: 1000 axioms and 10 rules. Now, each of the 1000 axioms satisfies one
of the rules.
Say I am processing rule-3 and it is applicable to only 200 axioms. I use
the filters (mentioned in previous mails) to obtain the subset.

In general terms, I think it can be stated as, given some data, I have to
obtain a subset of the data which satisfies several constraints and do some
processing on that subset.

What is the frequency of usage of filters and reads in a MR job? I think my
case is the extreme case where I use filter on each row that is mapped. In
general are filters used rarely from within a map()?

Regards,
Raghava.

On Wed, Jun 23, 2010 at 1:39 PM, Jean-Daniel Cryans <[email protected]>wrote:

> Inline.
>
> J-D
>
> On Wed, Jun 23, 2010 at 10:16 AM, Raghava Mutharaju
> <[email protected]> wrote:
> > Hello JD,
> >
> > Thank you for the response.
> >
> >>>> Are the Gets done on the same row that is mapped? Or on the same
> table?
> > Or another table?
> >    By row that is mapped, does it mean the row that is given to the map()
> > method as a <K,V> pair? Then no, the data from this row is used to
> construct
> > a filter which is applied on another table. No, it is not on the same
> table
> > that this row has come from.
>
> Ok thanks
>
> >
> >>>> Can you give a real example of what you are trying to achieve?
> > It is similar to a rule engine. I have to take in the data, apply some
> rules
> > on it and generate new data. These rules can be taken as "if..then"
> > statements with multiple conditions in "if". I have to check which subset
> of
> > data satisfies these conditions to apply the "then" part.
> > Eg: Transitive property. if (A<B and B<C and C<D) then A<D
> >
> > For implementing this I am using multiple filters. For the initial scan
> > which forms the InputSplit to the maps, I put in the first filter (say
> > something like get all the values which are > A). Then in the map, I
> would
> > take in the values (say B) and for each value, I have to put in 2 more
> > filters
> > Filter-1: Find all values (say C) which are greater than the B in above
> > step.
>
> So this is a Scan or a Get? Normally if you'd want to find all the
> rows that have some value > B then you'd do another MR job right? Else
> this is a full table scan?
>
> > Filter-2: For each value obtained as output (which is designated as C) of
> > Filter-1, find values greater than D.
> > Now take the output of Filter-2 and write it out into a table.
>
> What's D? Or you meant C? And to find all those values, is it another
> scan/get on a third table that gives you your D that you insert in a
> fourth table?
>
> >
> > Since there are multiple reads involved with each row received by the
> map,
> > it is slow. Is there any way to improve the speed? or is this type of
> > problem not suitable for HBase/Hadoop?
>
> I specifically asked you for an example involving data, because here
> you show us a potential solution that you are trying to optimize while
> not giving us a full problem statement. Do you really need 3-4 tables?
> What's your dataset like? Can we think of another way of doing this?
> Currently as far I as understand your problem, it won't scale since
> what you are doing is basically O(n^c).
>

Re: multiple reads from a Map - optimization question

Reply via email to