Re: RE: Add Columnsize Filter for Scan Operation

John Fri, 25 Oct 2013 16:18:38 -0700

@Dhaval: Thanks! I did'nt know that. I've created now a field in the Mapper
class which stores information about the map() before. That works fine for
me.


regards,
john


2013/10/25 Dhaval Shah <[email protected]>

> John, an important point to note here is that even though rows will get
> split over multiple calls to scanner.next(), all batches of 1 row will
> always reach 1 mapper. Another important point to note is that these
> batches will appear in consecutive calls to mapper.map()
>
> What this means is that you don't need to send your data to the reducer
> (and be more efficient by not writing to disk, no shuffle/sort phases and
> so on). You can just keep the state in memory for a particular row being
> processed (effectively a running count on the number of columns) and make
> the final decision when the row ends (effectively you encounter a different
> row or all rows are exhausted and you reach the cleanup function).
>
> The way I would do it is a map only MR job which keeps the state in memory
> as described above and uses the KeyOnlyFilter to reduce the amount of data
> flowing to the mapper
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: John <[email protected]>
> To: [email protected]; lars hofhansl <[email protected]>
> Sent: Friday, 25 October 2013 8:02 AM
> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>
>
> One thing I could do is to drop every batch-row where the column-size is
> smaller than the batch size. Something like if(rowsize < batchsize-1) drop
> row. The problem with this version is that the last row of a big row is
> also droped. Here a little example:
> There is one row:
> row1: 3500 columns
>
> If I set the batch to 1000. the mapper function got for the first row
>
> 1. Iteration: map function got 1000 columns -> write to disk for the
> reducer
> 2. Iteration map function got 1000 columns -> write to disk for the reducer
> 3. Iteration map function got 1000 columns -> write to disk for the reducer
> 4. Iteration map function got 500 columns -> drop, because it's smaller
> than the batch size
>
> Is there a way to count the columns over different map-functions?
>
> regards
>
>
>
> 2013/10/25 John <[email protected]>
>
> > I try to build a MR-Job, but in my case that doesn't work. Because if I
> > set for example the batch to 1000 and there are 5000 columns in row. Now
> i
> > found to generate something for rows where are the column size is bigger
> > than 2500. BUT since the map function is executed for every batch-row i
> > can't say if the row has a size bigger than 2500.
> >
> > any ideas?
> >
> >
> > 2013/10/25 lars hofhansl <[email protected]>
> >
> >> We need to finish up HBASE-8369
> >>
> >>
> >>
> >> ________________________________
> >>  From: Dhaval Shah <[email protected]>
> >> To: "[email protected]" <[email protected]>
> >> Sent: Thursday, October 24, 2013 4:38 PM
> >> Subject: Re: RE: Add Columnsize Filter for Scan Operation
> >>
> >>
> >> Well that depends on your use case ;)
> >>
> >> There are many nuances/code complexities to keep in mind:
> >> - merging results of various HFiles (each region can have.more than one)
> >> - merging results of WAL
> >> - applying delete markers
> >> - how about data which is only in memory of region servers and no where
> >> else
> >> - applying bloom filters for efficiency
> >> - what about hbase filters?
> >>
> >> At some point you would basically start rewriting an hbase region server
> >> on you map reduce job which is not ideal for maintainability.
> >>
> >> Do we ever read MySQL data files directly or issue a SQL query? Kind of
> >> goes back to the same argument ;)
> >>
> >> Sent from Yahoo Mail on Android
> >>
> >
> >
>

Re: RE: Add Columnsize Filter for Scan Operation

Reply via email to