One thing I could do is to drop every batch-row where the column-size is
smaller than the batch size. Something like if(rowsize < batchsize-1) drop
row. The problem with this version is that the last row of a big row is
also droped. Here a little example:
There is one row:
row1: 3500 columns

If I set the batch to 1000. the mapper function got for the first row

1. Iteration: map function got 1000 columns -> write to disk for the reducer
2. Iteration map function got 1000 columns -> write to disk for the reducer
3. Iteration map function got 1000 columns -> write to disk for the reducer
4. Iteration map function got 500 columns -> drop, because it's smaller
than the batch size

Is there a way to count the columns over different map-functions?

regards


2013/10/25 John <[email protected]>

> I try to build a MR-Job, but in my case that doesn't work. Because if I
> set for example the batch to 1000 and there are 5000 columns in row. Now i
> found to generate something for rows where are the column size is bigger
> than 2500. BUT since the map function is executed for every batch-row i
> can't say if the row has a size bigger than 2500.
>
> any ideas?
>
>
> 2013/10/25 lars hofhansl <[email protected]>
>
>> We need to finish up HBASE-8369
>>
>>
>>
>> ________________________________
>>  From: Dhaval Shah <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Sent: Thursday, October 24, 2013 4:38 PM
>> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>>
>>
>> Well that depends on your use case ;)
>>
>> There are many nuances/code complexities to keep in mind:
>> - merging results of various HFiles (each region can have.more than one)
>> - merging results of WAL
>> - applying delete markers
>> - how about data which is only in memory of region servers and no where
>> else
>> - applying bloom filters for efficiency
>> - what about hbase filters?
>>
>> At some point you would basically start rewriting an hbase region server
>> on you map reduce job which is not ideal for maintainability.
>>
>> Do we ever read MySQL data files directly or issue a SQL query? Kind of
>> goes back to the same argument ;)
>>
>> Sent from Yahoo Mail on Android
>>
>
>

Reply via email to