2011/1/9 Jack Levin <[email protected]> > Future wise we plan to have millions of rows, probably across multiple > regions, even if IO is not a problem, doing millions of filter operations > does not make much sense. >
It depends on selectivity of your photo column. If it is rare case (1% of rows has fotos), it is more wise to scan only photo family and then get another families. If selectivity is high, you will have small amount of mismatches. But I agree, that hbase doesn't have feature like "first check this family, and if it has value, proceed others", and in some case it can be very usefull (for inplace indexing). > > -Jack > > On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <[email protected]> wrote: > > > Ok. Understand. > > > > But do you check is it really an issue? I think that it is only 1 IO > here, > > (especially > > if compression used)? You have big rows? > > > > > > > > 2011/1/9 Jack Levin <[email protected]> > > > > > Sorting is not the issue, the location of data can be in the beginning, > > > middle or end, or any combination of thereof. I only given the worst > > case > > > scenario example, I understand that filtering will produce results we > > want > > > but at cost of examining every row and offloading AND/join logic to the > > > application. > > > > > > -Jack > > > > > > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <[email protected]> > > wrote: > > > > > > > More details on binary sorting you can read > > > > > > > > > > > > > > http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ > > > > > > > > 2011/1/8 Jack Levin <[email protected]> > > > > > > > > > Basic problem described: > > > > > > > > > > user uploads 1 image and creates some text -10 days ago, then > creates > > > > 1000 > > > > > text messages on between 9 days ago and today: > > > > > > > > > > > > > > > row key | fm:type --> value > > > > > > > > > > > > > > > 00days:uid | type:text --> text_id > > > > > > > > > > . > > > > > > > > > > . > > > > > > > > > > 09days:uid | type:text --> text_id > > > > > > > > > > > > > > > 10days:uid | type:photo --> URL > > > > > > > > > > | type:text --> text_id > > > > > > > > > > > > > > > Skip all the way to 10days:uid row, without reading 00days:id - > > 09:uid > > > > > rows. > > > > > Ideally we do not want to read all 1000 entries that have _only_ > > text. > > > > We > > > > > want to get to last entry in the most efficient way possible. > > > > > > > > > > > > > > > -Jack > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <[email protected]> wrote: > > > > > > Strike that. This is a Scan, so can't do blooms + filter. > Sorry. > > > > > > Sounds like a coprocessor then. You'd have your query 'lean' on > > the > > > > > > column that you know has the lesser items and then per item, > you'd > > do > > > > > > a get inside the coprocessor against the column of many entries. > > The > > > > > > get would go via blooms. > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <[email protected]> wrote: > > > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <[email protected]> > > > > wrote: > > > > > >>> Yes, we thought about using filters, the issue is, if one > family > > > > > >>> column has 1ml values, and second family column has 10 values > at > > > the > > > > > >>> bottom, we would end up scanning and filtering 99990 records > and > > > > > >>> throwing them away, which seems inefficient. > > > > > >> > > > > > >> Blooms+filters? > > > > > >> St.Ack > > > > > >> > > > > > > > > > > > > > > > > > > > > >
