Ted, this is a pretty clever idea. On Thu, Mar 31, 2011 at 9:27 PM, Ted Dunning <[email protected]> wrote:
> Solr/Elastic search is a fine solution, but probably won't be quite as fast > as a well-tuned hbase solution. > > One key assumption you seem to be making is that you will store messages > only once. If you are willing to make multiple updates to tables, then you > can arrange the natural ordering of the table to get what you want. For > instance, you could keep the most recent messages (say the last 10 from > each > of the 1000 most recently updated threads) in an in memory table. Then you > could store messages in a thread table indexed by thread:timestamp. > Finally > you could store messages in a table indexed by user:thread or > user:timestamp. This would allow you to display the most recent messages > or > thread in near zero time, to display all or the most recent messages from a > particular thread with only one retrieval and all of the messages from a > particular user in time order in one retrieval. > > On Thu, Mar 31, 2011 at 5:56 PM, Mark Jarecki <[email protected] > >wrote: > > > Hi all, > > > > I'm modelling a schema for storing and retrieving threaded messages, > where, > > for planning purposes: > > > > - there are many millions of users. > > - a user might have up to 1000 threads. > > - each thread might have up to 50000 messages (with some threads > > being sparse with only a few messages). > > - the Stargate REST interface is used. > > > > I want to be able to execute the following queries: > > > > - retrieve x latest active threads, with the latest message. > > - retrieve x latest active threads, with the latest message, > offset > > by y. > > - retrieve x latest messages from a thread. > > - retrieve x latest messages from a thread, offset by y. > > > > I've come up with a few possible methods for modelling this. But any > > insights would be greatly appreciated. > > > > Thanks in advance, > > > > Mark > > > > > > Possible solution 1: > > > > TABLE: threads > > KEY: userID : threadID > > COLUMN: latest_message > > > > TABLE: messages > > KEY: userID : threadID : timestamp > > COLUMN: message > > > > Messages are first written to the messages table, and then the threads > > table's thread is updated with the latest message. > > > > To fetch the latest x active threads, with the latest message: > > > > - I retrieve all threads and then sort and reduce the results on > the > > client. > > > > A concern with this is the fetching of all threads to sort on each > request. > > This could be unwieldy! > > > > > > Possible solution 2: > > > > TABLE: threads > > KEY: userID : timestamp : threadID > > COLUMN: latest_message > > > > TABLE: messages > > KEY: userID : threadID : timestamp > > COLUMN: message > > > > Messages are first written to the messages table, and then the threads > > table's is updated with the latest message. The previous latest message > is > > then deleted from the threads table. > > > > To fetch the latest x active threads, with the latest message: > > > > - I scan the threads table until I get x unique threads. > > > > A concern with this could be the issue of keeping the threads table in > sync > > with the messages table - especially with the deletion of old latest > > messages. > > > > > > Possible solution 3: > > > > TABLE: messages > > KEY: userID : timestamp : threadID > > COLUMN: message > > > > To fetch the latest x active threads, with the latest message: > > > > - I scan the messages table until I get x unique threads. > > > > One of my concerns with this method is that some threads will be busier > > than others, forcing a scan through nearly all of a user's messages. And > > there will be an ever increasing number of messages. A periodic archiving > > process - moving older messages to another table - might alleviate things > > here. > > > > > > Possible solution 4: > > > > Use SOLR/Elastic search or equivalent. > > > > > > > > > > >
