I'm still confused by 2 things: - Are the Gets done on the same row that is mapped? Or on the same table? Or another table? - Can you give a real example of what you are trying to achieve? (with fake data)
Thx J-D On Tue, Jun 22, 2010 at 10:51 AM, Raghava Mutharaju <m.vijayaragh...@gmail.com> wrote: >>>> This is not super clear, some comments inline. > I will try & explain better this time. > > The overall objective -- from the complete dataset, obtain a subset of it to > work on. Now this subset would be obtained by making use of the 2-3 > conditions (filters). The setting up of one filter depends on the output of > the previous filter. It is as follows > > Filter-1: Setup with the scan that is used for the map. > Filter-2: From the row that is coming into the map, extract some fields and > create a ColumnFilter/ValueFilter out of it. Row would be a delimited set of > values. > Filter-3: Apply filter-2 and from its output, extract the required fields > and do some processing. Then write it back to HBase table. > > Filters-2, 3 are used within the map. So I am using 1-2 Gets per row that > map receives. I cannot apply all the filters beforehand because the > subsequent filters have to be created based on previous filter's output. > > Yes, there would be more data. But currently, I am testing on data which > occupied only a single region. So only 1 map would be running on the cluster > and it is taking in all the data. > > This approach is slow and it shows in the results. Is there anyway, in which > this can be achieved with much improved performance? > > Thank you. > > Regards, > Raghava. > > On Tue, Jun 22, 2010 at 12:57 PM, Jean-Daniel Cryans > <jdcry...@apache.org>wrote: > >> This is not super clear, some comments inline. >> >> J-D >> >> On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju >> <m.vijayaragh...@gmail.com> wrote: >> > Hello all, >> > >> > In the data, I have to check for multiple conditions and then work >> > with the data that satisfies all the conditions. I am doing this as an MR >> > job with no reduce and the conditions are translated to a set of filters. >> > Among the multiple conditions (2 or 3 max), data that satisfies one of >> them >> > would come as input to the Map (initial filter is set in the scan to the >> > mappers). Now, from among the dataset that comes through to each map, I >> > would check for other conditions (1 or 2 remaining conditions). Since >> map() >> > is called for each row of data, it would mean 1 or 2 read calls (with >> > filter) to HBase tables. This setup, even for small data (data would fit >> in >> >> Here you talk about checking 1-2 two conditions... are they checked on >> the row that was mapped? Else that means that you are doing 1-2 Get >> per row? If so, this is definitely going to be slow! >> >> > a region and so only 1 map is taking in all the data) is very slow. >> >> What do you mean? That currently your test is done on 1 region but you >> expect more? If not, then don't use MR since that would give you >> nothing more than more code to write and more processing time. >> >> > >> > Here, note that, I shouldn't be filtering the incoming data to map but >> based >> > on that data, next set of filtering conditions would be formed. >> >> Can you give an example? >> >> > >> > Can this be improved? Would constructing secondary indexes help (would >> need >> > a dramatic improvement actually)? Or is this type of problem not suitable >> > for HBase? >> > >> > Thank you. >> > >> > Regards, >> > Raghava. >> > >> >