Re: question about merge-join (or AND operator betwween colums)

Jack Levin Sat, 08 Jan 2011 18:54:30 -0800

Suppose we used different families, how would it help ? 

-Jack



On Jan 8, 2011, at 6:47 PM, Todd Lipcon <[email protected]> wrote:

> Hi Jack,
> 
> Why not put photos and texts in separate column families?
> 
> -Todd
> 
> On Sat, Jan 8, 2011 at 2:57 PM, Jack Levin <[email protected]> wrote:
> 
>> Future wise we plan to have millions of rows, probably across multiple
>> regions, even if IO is not a problem, doing millions of filter operations
>> does not make much sense.
>> 
>> -Jack
>> 
>> On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <[email protected]> wrote:
>> 
>>> Ok. Understand.
>>> 
>>> But do you check is it really an issue? I think that it is only 1 IO
>> here,
>>> (especially
>>> if compression used)? You have big rows?
>>> 
>>> 
>>> 
>>> 2011/1/9 Jack Levin <[email protected]>
>>> 
>>>> Sorting is not the issue, the location of data can be in the beginning,
>>>> middle or end, or any combination of thereof.  I only given the worst
>>> case
>>>> scenario example, I understand that filtering will produce results we
>>> want
>>>> but at cost of examining every row and offloading AND/join logic to the
>>>> application.
>>>> 
>>>> -Jack
>>>> 
>>>> On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <[email protected]>
>>> wrote:
>>>> 
>>>>> More details on binary sorting you can read
>>>>> 
>>>>> 
>>>> 
>>> 
>> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
>>>>> 
>>>>> 2011/1/8 Jack Levin <[email protected]>
>>>>> 
>>>>>> Basic problem described:
>>>>>> 
>>>>>> user uploads 1 image and creates some text -10 days ago, then
>> creates
>>>>> 1000
>>>>>> text messages on between 9 days ago and today:
>>>>>> 
>>>>>> 
>>>>>> row key          | fm:type --> value
>>>>>> 
>>>>>> 
>>>>>> 00days:uid     | type:text --> text_id
>>>>>> 
>>>>>> .
>>>>>> 
>>>>>> .
>>>>>> 
>>>>>> 09days:uid | type:text --> text_id
>>>>>> 
>>>>>> 
>>>>>> 10days:uid     | type:photo --> URL
>>>>>> 
>>>>>>         | type:text --> text_id
>>>>>> 
>>>>>> 
>>>>>> Skip all the way to 10days:uid row, without reading 00days:id -
>>> 09:uid
>>>>>> rows.
>>>>>> Ideally we do not want to read all 1000 entries that have _only_
>>> text.
>>>>> We
>>>>>> want to get to last entry in the most efficient way possible.
>>>>>> 
>>>>>> 
>>>>>> -Jack
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jan 8, 2011 at 11:43 AM, Stack <[email protected]> wrote:
>>>>>>> Strike that.  This is a Scan, so can't do blooms + filter.
>> Sorry.
>>>>>>> Sounds like a coprocessor then.  You'd have your query 'lean' on
>>> the
>>>>>>> column that you know has the lesser items and then per item,
>> you'd
>>> do
>>>>>>> a get inside the coprocessor against the column of many entries.
>>> The
>>>>>>> get would go via blooms.
>>>>>>> 
>>>>>>> St.Ack
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Jan 8, 2011 at 11:39 AM, Stack <[email protected]> wrote:
>>>>>>>> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <[email protected]>
>>>>> wrote:
>>>>>>>>> Yes, we thought about using filters, the issue is, if one
>> family
>>>>>>>>> column has 1ml values, and second family column has 10 values
>> at
>>>> the
>>>>>>>>> bottom, we would end up scanning and filtering 99990 records
>> and
>>>>>>>>> throwing them away, which seems inefficient.
>>>>>>>> 
>>>>>>>> Blooms+filters?
>>>>>>>> St.Ack
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: question about merge-join (or AND operator betwween colums)

Reply via email to