@Dhaval: Thanks! I did'nt know that. I've created now a field in the Mapper class which stores information about the map() before. That works fine for me.
regards, john 2013/10/25 Dhaval Shah <[email protected]> > John, an important point to note here is that even though rows will get > split over multiple calls to scanner.next(), all batches of 1 row will > always reach 1 mapper. Another important point to note is that these > batches will appear in consecutive calls to mapper.map() > > What this means is that you don't need to send your data to the reducer > (and be more efficient by not writing to disk, no shuffle/sort phases and > so on). You can just keep the state in memory for a particular row being > processed (effectively a running count on the number of columns) and make > the final decision when the row ends (effectively you encounter a different > row or all rows are exhausted and you reach the cleanup function). > > The way I would do it is a map only MR job which keeps the state in memory > as described above and uses the KeyOnlyFilter to reduce the amount of data > flowing to the mapper > > Regards, > Dhaval > > > ________________________________ > From: John <[email protected]> > To: [email protected]; lars hofhansl <[email protected]> > Sent: Friday, 25 October 2013 8:02 AM > Subject: Re: RE: Add Columnsize Filter for Scan Operation > > > One thing I could do is to drop every batch-row where the column-size is > smaller than the batch size. Something like if(rowsize < batchsize-1) drop > row. The problem with this version is that the last row of a big row is > also droped. Here a little example: > There is one row: > row1: 3500 columns > > If I set the batch to 1000. the mapper function got for the first row > > 1. Iteration: map function got 1000 columns -> write to disk for the > reducer > 2. Iteration map function got 1000 columns -> write to disk for the reducer > 3. Iteration map function got 1000 columns -> write to disk for the reducer > 4. Iteration map function got 500 columns -> drop, because it's smaller > than the batch size > > Is there a way to count the columns over different map-functions? > > regards > > > > 2013/10/25 John <[email protected]> > > > I try to build a MR-Job, but in my case that doesn't work. Because if I > > set for example the batch to 1000 and there are 5000 columns in row. Now > i > > found to generate something for rows where are the column size is bigger > > than 2500. BUT since the map function is executed for every batch-row i > > can't say if the row has a size bigger than 2500. > > > > any ideas? > > > > > > 2013/10/25 lars hofhansl <[email protected]> > > > >> We need to finish up HBASE-8369 > >> > >> > >> > >> ________________________________ > >> From: Dhaval Shah <[email protected]> > >> To: "[email protected]" <[email protected]> > >> Sent: Thursday, October 24, 2013 4:38 PM > >> Subject: Re: RE: Add Columnsize Filter for Scan Operation > >> > >> > >> Well that depends on your use case ;) > >> > >> There are many nuances/code complexities to keep in mind: > >> - merging results of various HFiles (each region can have.more than one) > >> - merging results of WAL > >> - applying delete markers > >> - how about data which is only in memory of region servers and no where > >> else > >> - applying bloom filters for efficiency > >> - what about hbase filters? > >> > >> At some point you would basically start rewriting an hbase region server > >> on you map reduce job which is not ideal for maintainability. > >> > >> Do we ever read MySQL data files directly or issue a SQL query? Kind of > >> goes back to the same argument ;) > >> > >> Sent from Yahoo Mail on Android > >> > > > > >
