Re: Design questions/Schema help

Mark Mon, 26 Jul 2010 19:02:38 -0700

On 7/26/10 6:06 PM, Dave Viner wrote:

I'd love to hear other's opinions here... but here are my 2 cents.
With Cassandra, you need to think of the queries - which you've prettymuch done.
For the most popular queries, you could do something like:

<ColumnFamily Name="QueriesCounted"
                ComparesWith="UTF8Type"
                />
And then access it as:
key-space.QueriesCounted['query-foo-bar'] = $count;
This makes it easy to get the count for any particular query. I'm notsure the best way to store the "top counts" idea. Perhaps a secondaryprocess which iterates over all the queries to see which sorts thequery values by count, and then stores them into another ColumnFamily.
You could use the same idea for the last query (session ids by query)

<ColumnFamily Name="QueriesRecorded"
                ComparesWith="UTF8Type"
                ColumnType="super"
CompareSubcolumnsWith="TimeUUIDType"
                />
And then access it as:
key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;
Actually, if you used that idea (queries-recorded), you could generatethe counts and aggregates from that directly in a hadooppost-processing...
But perhaps others will have better ideas. If you haven't readhttp://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model, go readit now. It won't answer your question directly, but will describe theprocess of modeling a blog in cassandra so you can get a sense of theprocess.
Dave Viner
On Mon, Jul 26, 2010 at 4:46 PM, Mark <[email protected]<mailto:[email protected]>> wrote:
    We are thinking about using Cassandra to store our search logs.
    Can someone point me in the right direction/lend some guidance on
    design? I am new to Cassandra and I am having trouble wrapping my
    head around some of these new concepts. My brain keeps wanting to
    go back to a RDBMS design.

    We will be storing the user query, # of hits returned and their
    session id. We would like to be able to answer the following
    questions.

    - What is the n most popular queries and their counts within the
    last x (mins/hours/days/etc). Basically the most popular searches
    within a given time range.
    - What is the most popular query within the last x where hits = 0.
    Same as above but with an extra "where" clause
    - For session id x give me all their other queries
    - What are all the session ids that searched for 'foos'

    We accomplish the above functionality w/ MySQL using 2 tables. One
    for the raw search log information and the other to keep the
    aggregate/running counts of queries.

    Would this sort of ad-hoc querying be better implemented using
    Hadoop + Hive? If so, should I be storing all this information in
    Cassandra then using Hadoop to retrieve it?

    Thanks for your suggestions

"Perhaps a secondary process which iterates over all the queries to seewhich sorts the query values by count, and then stores them into anotherColumnFamily."

- I was trying to avoid this. Is there some sort of atomic incrementfeature available? I guess I could do the same thing we are currentlydoing which is...


a) store full query details into table A
b) query table B for aggregate count of query 'foo' then store count + 1

Re: Design questions/Schema help

Reply via email to