Re: Modelling threaded messages

Mark Jarecki Sat, 02 Apr 2011 03:22:34 -0700

Thanks Ted, for the suggestions.

On 02/04/2011, at 3:58 PM, Ted Dunning wrote:


> Depending on the speed requirements associated with retrieving bunches of
> messages, hbase may have a real edge here.  This is a special problem in
> that there are common query patterns that allow contiguous reads of lots of
> data.  That gives a huge advantage to systems like hbase that store data
> organized by key.
> 
> You might view it as the karmic opposite of the common hot-spotting problem
> due to storing elements by time-stamp.
> 
> On Fri, Apr 1, 2011 at 8:10 PM, Kevin Apte <[email protected]
>> wrote:
> 
>> Have you considered using Cassandra?
>> 
>> Kevin
>> 
>> 
>> On Fri, Apr 1, 2011 at 11:01 PM, M. C. Srivas <[email protected]> wrote:
>> 
>>> Ted, this is a pretty clever idea.
>>> 
>>> On Thu, Mar 31, 2011 at 9:27 PM, Ted Dunning <[email protected]>
>>> wrote:
>>> 
>>>> Solr/Elastic search is a fine solution, but probably won't be quite as
>>> fast
>>>> as a well-tuned hbase solution.
>>>> 
>>>> One key assumption you seem to be making is that you will store
>> messages
>>>> only once.  If you are willing to make multiple updates to tables, then
>>> you
>>>> can arrange the natural ordering of the table to get what you want.
>> For
>>>> instance, you could keep the most recent messages (say the last 10 from
>>>> each
>>>> of the 1000 most recently updated threads) in an in memory table.  Then
>>> you
>>>> could store messages in a thread table indexed by thread:timestamp.
>>>> Finally
>>>> you could store messages in a table indexed by user:thread or
>>>> user:timestamp.  This would allow you to display the most recent
>> messages
>>>> or
>>>> thread in near zero time, to display all or the most recent messages
>> from
>>> a
>>>> particular thread with only one retrieval and all of the messages from
>> a
>>>> particular user in time order in one retrieval.
>>>> 
>>>> On Thu, Mar 31, 2011 at 5:56 PM, Mark Jarecki <[email protected]
>>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'm modelling a schema for storing and retrieving threaded messages,
>>>> where,
>>>>> for planning purposes:
>>>>> 
>>>>>       - there are many millions of users.
>>>>>       - a user might have up to 1000 threads.
>>>>>       - each thread might have up to 50000 messages (with some
>> threads
>>>>> being sparse with only a few messages).
>>>>>       - the Stargate REST interface is used.
>>>>> 
>>>>> I want to be able to execute the following queries:
>>>>> 
>>>>>       - retrieve x latest active threads, with the latest message.
>>>>>       - retrieve x latest active threads, with the latest message,
>>>> offset
>>>>> by y.
>>>>>       - retrieve x latest messages from a thread.
>>>>>       - retrieve x latest messages from a thread, offset by y.
>>>>> 
>>>>> I've come up with a few possible methods for modelling this. But any
>>>>> insights would be greatly appreciated.
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> Mark
>>>>> 
>>>>> 
>>>>> Possible solution 1:
>>>>> 
>>>>> TABLE:          threads
>>>>> KEY:            userID : threadID
>>>>> COLUMN:         latest_message
>>>>> 
>>>>> TABLE:          messages
>>>>> KEY:            userID : threadID : timestamp
>>>>> COLUMN:         message
>>>>> 
>>>>> Messages are first written to the messages table, and then the
>> threads
>>>>> table's thread is updated with the latest message.
>>>>> 
>>>>> To fetch the latest x active threads, with the latest message:
>>>>> 
>>>>>       - I retrieve all threads and then sort and reduce the results
>> on
>>>> the
>>>>> client.
>>>>> 
>>>>> A concern with this is the fetching of all threads to sort on each
>>>> request.
>>>>> This could be unwieldy!
>>>>> 
>>>>> 
>>>>> Possible solution 2:
>>>>> 
>>>>> TABLE:          threads
>>>>> KEY:            userID : timestamp : threadID
>>>>> COLUMN:         latest_message
>>>>> 
>>>>> TABLE:          messages
>>>>> KEY:            userID : threadID : timestamp
>>>>> COLUMN:         message
>>>>> 
>>>>> Messages are first written to the messages table, and then the
>> threads
>>>>> table's is updated with the latest message. The previous latest
>> message
>>>> is
>>>>> then deleted from the threads table.
>>>>> 
>>>>> To fetch the latest x active threads, with the latest message:
>>>>> 
>>>>>       - I scan the threads table until I get x unique threads.
>>>>> 
>>>>> A concern with this could be the issue of keeping the threads table
>> in
>>>> sync
>>>>> with the messages table - especially with the deletion of old latest
>>>>> messages.
>>>>> 
>>>>> 
>>>>> Possible solution 3:
>>>>> 
>>>>> TABLE:          messages
>>>>> KEY:            userID : timestamp : threadID
>>>>> COLUMN:         message
>>>>> 
>>>>> To fetch the latest x active threads, with the latest message:
>>>>> 
>>>>>       - I scan the messages table until I get x unique threads.
>>>>> 
>>>>> One of my concerns with this method is that some threads will be
>> busier
>>>>> than others, forcing a scan through nearly all of a user's messages.
>>> And
>>>>> there will be an ever increasing number of messages. A periodic
>>> archiving
>>>>> process - moving older messages to another table - might alleviate
>>> things
>>>>> here.
>>>>> 
>>>>> 
>>>>> Possible solution 4:
>>>>> 
>>>>> Use SOLR/Elastic search or equivalent.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Modelling threaded messages

Reply via email to