Suppose we used different families, how would it help ? -Jack
On Jan 8, 2011, at 6:47 PM, Todd Lipcon <[email protected]> wrote: > Hi Jack, > > Why not put photos and texts in separate column families? > > -Todd > > On Sat, Jan 8, 2011 at 2:57 PM, Jack Levin <[email protected]> wrote: > >> Future wise we plan to have millions of rows, probably across multiple >> regions, even if IO is not a problem, doing millions of filter operations >> does not make much sense. >> >> -Jack >> >> On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <[email protected]> wrote: >> >>> Ok. Understand. >>> >>> But do you check is it really an issue? I think that it is only 1 IO >> here, >>> (especially >>> if compression used)? You have big rows? >>> >>> >>> >>> 2011/1/9 Jack Levin <[email protected]> >>> >>>> Sorting is not the issue, the location of data can be in the beginning, >>>> middle or end, or any combination of thereof. I only given the worst >>> case >>>> scenario example, I understand that filtering will produce results we >>> want >>>> but at cost of examining every row and offloading AND/join logic to the >>>> application. >>>> >>>> -Jack >>>> >>>> On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <[email protected]> >>> wrote: >>>> >>>>> More details on binary sorting you can read >>>>> >>>>> >>>> >>> >> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ >>>>> >>>>> 2011/1/8 Jack Levin <[email protected]> >>>>> >>>>>> Basic problem described: >>>>>> >>>>>> user uploads 1 image and creates some text -10 days ago, then >> creates >>>>> 1000 >>>>>> text messages on between 9 days ago and today: >>>>>> >>>>>> >>>>>> row key | fm:type --> value >>>>>> >>>>>> >>>>>> 00days:uid | type:text --> text_id >>>>>> >>>>>> . >>>>>> >>>>>> . >>>>>> >>>>>> 09days:uid | type:text --> text_id >>>>>> >>>>>> >>>>>> 10days:uid | type:photo --> URL >>>>>> >>>>>> | type:text --> text_id >>>>>> >>>>>> >>>>>> Skip all the way to 10days:uid row, without reading 00days:id - >>> 09:uid >>>>>> rows. >>>>>> Ideally we do not want to read all 1000 entries that have _only_ >>> text. >>>>> We >>>>>> want to get to last entry in the most efficient way possible. >>>>>> >>>>>> >>>>>> -Jack >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Jan 8, 2011 at 11:43 AM, Stack <[email protected]> wrote: >>>>>>> Strike that. This is a Scan, so can't do blooms + filter. >> Sorry. >>>>>>> Sounds like a coprocessor then. You'd have your query 'lean' on >>> the >>>>>>> column that you know has the lesser items and then per item, >> you'd >>> do >>>>>>> a get inside the coprocessor against the column of many entries. >>> The >>>>>>> get would go via blooms. >>>>>>> >>>>>>> St.Ack >>>>>>> >>>>>>> >>>>>>> On Sat, Jan 8, 2011 at 11:39 AM, Stack <[email protected]> wrote: >>>>>>>> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <[email protected]> >>>>> wrote: >>>>>>>>> Yes, we thought about using filters, the issue is, if one >> family >>>>>>>>> column has 1ml values, and second family column has 10 values >> at >>>> the >>>>>>>>> bottom, we would end up scanning and filtering 99990 records >> and >>>>>>>>> throwing them away, which seems inefficient. >>>>>>>> >>>>>>>> Blooms+filters? >>>>>>>> St.Ack >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera
