Thanks Ted, for the suggestions. On 02/04/2011, at 3:58 PM, Ted Dunning wrote:
> Depending on the speed requirements associated with retrieving bunches of > messages, hbase may have a real edge here. This is a special problem in > that there are common query patterns that allow contiguous reads of lots of > data. That gives a huge advantage to systems like hbase that store data > organized by key. > > You might view it as the karmic opposite of the common hot-spotting problem > due to storing elements by time-stamp. > > On Fri, Apr 1, 2011 at 8:10 PM, Kevin Apte <[email protected] >> wrote: > >> Have you considered using Cassandra? >> >> Kevin >> >> >> On Fri, Apr 1, 2011 at 11:01 PM, M. C. Srivas <[email protected]> wrote: >> >>> Ted, this is a pretty clever idea. >>> >>> On Thu, Mar 31, 2011 at 9:27 PM, Ted Dunning <[email protected]> >>> wrote: >>> >>>> Solr/Elastic search is a fine solution, but probably won't be quite as >>> fast >>>> as a well-tuned hbase solution. >>>> >>>> One key assumption you seem to be making is that you will store >> messages >>>> only once. If you are willing to make multiple updates to tables, then >>> you >>>> can arrange the natural ordering of the table to get what you want. >> For >>>> instance, you could keep the most recent messages (say the last 10 from >>>> each >>>> of the 1000 most recently updated threads) in an in memory table. Then >>> you >>>> could store messages in a thread table indexed by thread:timestamp. >>>> Finally >>>> you could store messages in a table indexed by user:thread or >>>> user:timestamp. This would allow you to display the most recent >> messages >>>> or >>>> thread in near zero time, to display all or the most recent messages >> from >>> a >>>> particular thread with only one retrieval and all of the messages from >> a >>>> particular user in time order in one retrieval. >>>> >>>> On Thu, Mar 31, 2011 at 5:56 PM, Mark Jarecki <[email protected] >>>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'm modelling a schema for storing and retrieving threaded messages, >>>> where, >>>>> for planning purposes: >>>>> >>>>> - there are many millions of users. >>>>> - a user might have up to 1000 threads. >>>>> - each thread might have up to 50000 messages (with some >> threads >>>>> being sparse with only a few messages). >>>>> - the Stargate REST interface is used. >>>>> >>>>> I want to be able to execute the following queries: >>>>> >>>>> - retrieve x latest active threads, with the latest message. >>>>> - retrieve x latest active threads, with the latest message, >>>> offset >>>>> by y. >>>>> - retrieve x latest messages from a thread. >>>>> - retrieve x latest messages from a thread, offset by y. >>>>> >>>>> I've come up with a few possible methods for modelling this. But any >>>>> insights would be greatly appreciated. >>>>> >>>>> Thanks in advance, >>>>> >>>>> Mark >>>>> >>>>> >>>>> Possible solution 1: >>>>> >>>>> TABLE: threads >>>>> KEY: userID : threadID >>>>> COLUMN: latest_message >>>>> >>>>> TABLE: messages >>>>> KEY: userID : threadID : timestamp >>>>> COLUMN: message >>>>> >>>>> Messages are first written to the messages table, and then the >> threads >>>>> table's thread is updated with the latest message. >>>>> >>>>> To fetch the latest x active threads, with the latest message: >>>>> >>>>> - I retrieve all threads and then sort and reduce the results >> on >>>> the >>>>> client. >>>>> >>>>> A concern with this is the fetching of all threads to sort on each >>>> request. >>>>> This could be unwieldy! >>>>> >>>>> >>>>> Possible solution 2: >>>>> >>>>> TABLE: threads >>>>> KEY: userID : timestamp : threadID >>>>> COLUMN: latest_message >>>>> >>>>> TABLE: messages >>>>> KEY: userID : threadID : timestamp >>>>> COLUMN: message >>>>> >>>>> Messages are first written to the messages table, and then the >> threads >>>>> table's is updated with the latest message. The previous latest >> message >>>> is >>>>> then deleted from the threads table. >>>>> >>>>> To fetch the latest x active threads, with the latest message: >>>>> >>>>> - I scan the threads table until I get x unique threads. >>>>> >>>>> A concern with this could be the issue of keeping the threads table >> in >>>> sync >>>>> with the messages table - especially with the deletion of old latest >>>>> messages. >>>>> >>>>> >>>>> Possible solution 3: >>>>> >>>>> TABLE: messages >>>>> KEY: userID : timestamp : threadID >>>>> COLUMN: message >>>>> >>>>> To fetch the latest x active threads, with the latest message: >>>>> >>>>> - I scan the messages table until I get x unique threads. >>>>> >>>>> One of my concerns with this method is that some threads will be >> busier >>>>> than others, forcing a scan through nearly all of a user's messages. >>> And >>>>> there will be an ever increasing number of messages. A periodic >>> archiving >>>>> process - moving older messages to another table - might alleviate >>> things >>>>> here. >>>>> >>>>> >>>>> Possible solution 4: >>>>> >>>>> Use SOLR/Elastic search or equivalent. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>
