One thing I could do is to drop every batch-row where the column-size is smaller than the batch size. Something like if(rowsize < batchsize-1) drop row. The problem with this version is that the last row of a big row is also droped. Here a little example: There is one row: row1: 3500 columns
If I set the batch to 1000. the mapper function got for the first row 1. Iteration: map function got 1000 columns -> write to disk for the reducer 2. Iteration map function got 1000 columns -> write to disk for the reducer 3. Iteration map function got 1000 columns -> write to disk for the reducer 4. Iteration map function got 500 columns -> drop, because it's smaller than the batch size Is there a way to count the columns over different map-functions? regards 2013/10/25 John <[email protected]> > I try to build a MR-Job, but in my case that doesn't work. Because if I > set for example the batch to 1000 and there are 5000 columns in row. Now i > found to generate something for rows where are the column size is bigger > than 2500. BUT since the map function is executed for every batch-row i > can't say if the row has a size bigger than 2500. > > any ideas? > > > 2013/10/25 lars hofhansl <[email protected]> > >> We need to finish up HBASE-8369 >> >> >> >> ________________________________ >> From: Dhaval Shah <[email protected]> >> To: "[email protected]" <[email protected]> >> Sent: Thursday, October 24, 2013 4:38 PM >> Subject: Re: RE: Add Columnsize Filter for Scan Operation >> >> >> Well that depends on your use case ;) >> >> There are many nuances/code complexities to keep in mind: >> - merging results of various HFiles (each region can have.more than one) >> - merging results of WAL >> - applying delete markers >> - how about data which is only in memory of region servers and no where >> else >> - applying bloom filters for efficiency >> - what about hbase filters? >> >> At some point you would basically start rewriting an hbase region server >> on you map reduce job which is not ideal for maintainability. >> >> Do we ever read MySQL data files directly or issue a SQL query? Kind of >> goes back to the same argument ;) >> >> Sent from Yahoo Mail on Android >> > >
