I'm still confused by 2 things:

 - Are the Gets done on the same row that is mapped? Or on the same
table? Or another table?
 - Can you give a real example of what you are trying to achieve?
(with fake data)

Thx

J-D

On Tue, Jun 22, 2010 at 10:51 AM, Raghava Mutharaju
<m.vijayaragh...@gmail.com> wrote:
>>>> This is not super clear, some comments inline.
> I will try & explain better this time.
>
> The overall objective -- from the complete dataset, obtain a subset of it to
> work on. Now this subset would be obtained by making use of the 2-3
> conditions (filters). The setting up of one filter depends on the output of
> the previous filter. It is as follows
>
> Filter-1: Setup with the scan that is used for the map.
> Filter-2: From the row that is coming into the map, extract some fields and
> create a ColumnFilter/ValueFilter out of it. Row would be a delimited set of
> values.
> Filter-3: Apply filter-2 and from its output, extract the required fields
> and do some processing. Then write it back to HBase table.
>
> Filters-2, 3 are used within the map. So I am using 1-2 Gets per row that
> map receives. I cannot apply all the filters beforehand because the
> subsequent filters have to be created based on previous filter's output.
>
> Yes, there would be more data. But currently, I am testing on data which
> occupied only a single region. So only 1 map would be running on the cluster
> and it is taking in all the data.
>
> This approach is slow and it shows in the results. Is there anyway, in which
> this can be achieved with much improved performance?
>
> Thank you.
>
> Regards,
> Raghava.
>
> On Tue, Jun 22, 2010 at 12:57 PM, Jean-Daniel Cryans 
> <jdcry...@apache.org>wrote:
>
>> This is not super clear, some comments inline.
>>
>> J-D
>>
>> On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju
>> <m.vijayaragh...@gmail.com> wrote:
>> > Hello all,
>> >
>> >      In the data, I have to check for multiple conditions and then work
>> > with the data that satisfies all the conditions. I am doing this as an MR
>> > job with no reduce and the conditions are translated to a set of filters.
>> > Among the multiple conditions (2 or 3 max), data that satisfies one of
>> them
>> > would come as input to the Map (initial filter is set in the scan to the
>> > mappers). Now, from among the dataset that comes through to each map, I
>> > would check for other conditions (1 or 2 remaining conditions). Since
>> map()
>> > is called for each row of data, it would mean 1 or 2 read calls (with
>> > filter) to HBase tables. This setup, even for small data (data would fit
>> in
>>
>> Here you talk about checking 1-2 two conditions... are they checked on
>> the row that was mapped? Else that means that you are doing 1-2 Get
>> per row? If so, this is definitely going to be slow!
>>
>> > a region and so only 1 map is taking in all the data) is very slow.
>>
>> What do you mean? That currently your test is done on 1 region but you
>> expect more? If not, then don't use MR since that would give you
>> nothing more than more code to write and more processing time.
>>
>> >
>> > Here, note that, I shouldn't be filtering the incoming data to map but
>> based
>> > on that data, next set of filtering conditions would be formed.
>>
>> Can you give an example?
>>
>> >
>> > Can this be improved? Would constructing secondary indexes help (would
>> need
>> > a dramatic improvement actually)? Or is this type of problem not suitable
>> > for HBase?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Raghava.
>> >
>>
>

Reply via email to