John, an important point to note here is that even though rows will get split over multiple calls to scanner.next(), all batches of 1 row will always reach 1 mapper. Another important point to note is that these batches will appear in consecutive calls to mapper.map()
What this means is that you don't need to send your data to the reducer (and be more efficient by not writing to disk, no shuffle/sort phases and so on). You can just keep the state in memory for a particular row being processed (effectively a running count on the number of columns) and make the final decision when the row ends (effectively you encounter a different row or all rows are exhausted and you reach the cleanup function). The way I would do it is a map only MR job which keeps the state in memory as described above and uses the KeyOnlyFilter to reduce the amount of data flowing to the mapper Regards, Dhaval ________________________________ From: John <[email protected]> To: [email protected]; lars hofhansl <[email protected]> Sent: Friday, 25 October 2013 8:02 AM Subject: Re: RE: Add Columnsize Filter for Scan Operation One thing I could do is to drop every batch-row where the column-size is smaller than the batch size. Something like if(rowsize < batchsize-1) drop row. The problem with this version is that the last row of a big row is also droped. Here a little example: There is one row: row1: 3500 columns If I set the batch to 1000. the mapper function got for the first row 1. Iteration: map function got 1000 columns -> write to disk for the reducer 2. Iteration map function got 1000 columns -> write to disk for the reducer 3. Iteration map function got 1000 columns -> write to disk for the reducer 4. Iteration map function got 500 columns -> drop, because it's smaller than the batch size Is there a way to count the columns over different map-functions? regards 2013/10/25 John <[email protected]> > I try to build a MR-Job, but in my case that doesn't work. Because if I > set for example the batch to 1000 and there are 5000 columns in row. Now i > found to generate something for rows where are the column size is bigger > than 2500. BUT since the map function is executed for every batch-row i > can't say if the row has a size bigger than 2500. > > any ideas? > > > 2013/10/25 lars hofhansl <[email protected]> > >> We need to finish up HBASE-8369 >> >> >> >> ________________________________ >> From: Dhaval Shah <[email protected]> >> To: "[email protected]" <[email protected]> >> Sent: Thursday, October 24, 2013 4:38 PM >> Subject: Re: RE: Add Columnsize Filter for Scan Operation >> >> >> Well that depends on your use case ;) >> >> There are many nuances/code complexities to keep in mind: >> - merging results of various HFiles (each region can have.more than one) >> - merging results of WAL >> - applying delete markers >> - how about data which is only in memory of region servers and no where >> else >> - applying bloom filters for efficiency >> - what about hbase filters? >> >> At some point you would basically start rewriting an hbase region server >> on you map reduce job which is not ideal for maintainability. >> >> Do we ever read MySQL data files directly or issue a SQL query? Kind of >> goes back to the same argument ;) >> >> Sent from Yahoo Mail on Android >> > >
