John, an important point to note here is that even though rows will get split 
over multiple calls to scanner.next(), all batches of 1 row will always reach 1 
mapper. Another important point to note is that these batches will appear in 
consecutive calls to mapper.map()

What this means is that you don't need to send your data to the reducer (and be 
more efficient by not writing to disk, no shuffle/sort phases and so on). You 
can just keep the state in memory for a particular row being processed 
(effectively a running count on the number of columns) and make the final 
decision when the row ends (effectively you encounter a different row or all 
rows are exhausted and you reach the cleanup function).

The way I would do it is a map only MR job which keeps the state in memory as 
described above and uses the KeyOnlyFilter to reduce the amount of data flowing 
to the mapper
 
Regards,
Dhaval


________________________________
 From: John <[email protected]>
To: [email protected]; lars hofhansl <[email protected]> 
Sent: Friday, 25 October 2013 8:02 AM
Subject: Re: RE: Add Columnsize Filter for Scan Operation
 

One thing I could do is to drop every batch-row where the column-size is
smaller than the batch size. Something like if(rowsize < batchsize-1) drop
row. The problem with this version is that the last row of a big row is
also droped. Here a little example:
There is one row:
row1: 3500 columns

If I set the batch to 1000. the mapper function got for the first row

1. Iteration: map function got 1000 columns -> write to disk for the reducer
2. Iteration map function got 1000 columns -> write to disk for the reducer
3. Iteration map function got 1000 columns -> write to disk for the reducer
4. Iteration map function got 500 columns -> drop, because it's smaller
than the batch size

Is there a way to count the columns over different map-functions?

regards



2013/10/25 John <[email protected]>

> I try to build a MR-Job, but in my case that doesn't work. Because if I
> set for example the batch to 1000 and there are 5000 columns in row. Now i
> found to generate something for rows where are the column size is bigger
> than 2500. BUT since the map function is executed for every batch-row i
> can't say if the row has a size bigger than 2500.
>
> any ideas?
>
>
> 2013/10/25 lars hofhansl <[email protected]>
>
>> We need to finish up HBASE-8369
>>
>>
>>
>> ________________________________
>>  From: Dhaval Shah <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Sent: Thursday, October 24, 2013 4:38 PM
>> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>>
>>
>> Well that depends on your use case ;)
>>
>> There are many nuances/code complexities to keep in mind:
>> - merging results of various HFiles (each region can have.more than one)
>> - merging results of WAL
>> - applying delete markers
>> - how about data which is only in memory of region servers and no where
>> else
>> - applying bloom filters for efficiency
>> - what about hbase filters?
>>
>> At some point you would basically start rewriting an hbase region server
>> on you map reduce job which is not ideal for maintainability.
>>
>> Do we ever read MySQL data files directly or issue a SQL query? Kind of
>> goes back to the same argument ;)
>>
>> Sent from Yahoo Mail on Android
>>
>
>

Reply via email to