Re: Design questions/Schema help

Dave Viner Mon, 26 Jul 2010 19:06:35 -0700

AFAIK, atomic increments are not available.  There recently has been quite a
bit of discussion about them.  So, you might search the archives.



Dave Viner

On Mon, Jul 26, 2010 at 7:02 PM, Mark <[email protected]> wrote:

>  On 7/26/10 6:06 PM, Dave Viner wrote:
>
> I'd love to hear other's opinions here... but here are my 2 cents.
>
>  With Cassandra, you need to think of the queries - which you've pretty
> much done.
>
>  For the most popular queries, you could do something like:
>
>              <ColumnFamily Name="QueriesCounted"
>                 ComparesWith="UTF8Type"
>                 />
> And then access it as:
> key-space.QueriesCounted['query-foo-bar'] = $count;
>
>  This makes it easy to get the count for any particular query.  I'm not
> sure the best way to store the "top counts" idea.  Perhaps a secondary
> process which iterates over all the queries to see which sorts the query
> values by count, and then stores them into another ColumnFamily.
>
>  You could use the same idea for the last query (session ids by query)
>
>              <ColumnFamily Name="QueriesRecorded"
>                 ComparesWith="UTF8Type"
>                 ColumnType="super"
> CompareSubcolumnsWith="TimeUUIDType"
>                 />
> And then access it as:
> key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;
>
>  Actually, if you used that idea (queries-recorded), you could generate
> the counts and aggregates from that directly in a hadoop post-processing...
>
>  But perhaps others will have better ideas.  If you haven't read
> http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model, go read it
> now.  It won't answer your question directly, but will describe the process
> of modeling a blog in cassandra so you can get a sense of the process.
>
>  Dave Viner
>
>
>
>
> On Mon, Jul 26, 2010 at 4:46 PM, Mark <[email protected]> wrote:
>
>>  We are thinking about using Cassandra to store our search logs. Can
>> someone point me in the right direction/lend some guidance on design? I am
>> new to Cassandra and I am having trouble wrapping my head around some of
>> these new concepts. My brain keeps wanting to go back to a RDBMS design.
>>
>> We will be storing the user query, # of hits returned and their session
>> id. We would like to be able to answer the following questions.
>>
>> - What is the n most popular queries and their counts within the last x
>> (mins/hours/days/etc). Basically the most popular searches within a given
>> time range.
>> - What is the most popular query within the last x where hits = 0. Same as
>> above but with an extra "where" clause
>> - For session id x give me all their other queries
>> - What are all the session ids that searched for 'foos'
>>
>> We accomplish the above functionality w/ MySQL using 2 tables. One for the
>> raw search log information and the other to keep the aggregate/running
>> counts of queries.
>>
>> Would this sort of ad-hoc querying be better implemented using Hadoop +
>> Hive? If so, should I be storing all this information in Cassandra then
>> using Hadoop to retrieve it?
>>
>> Thanks for your suggestions
>>
>
>  "Perhaps a secondary process which iterates over all the queries to see
> which sorts the query values by count, and then stores them into another
> ColumnFamily."
>
> - I was trying to avoid this. Is there some sort of atomic increment
> feature available? I guess I could do the same thing we are currently doing
> which is...
>
> a) store full query details into table A
> b) query table B for aggregate count of query 'foo' then store count + 1
>

Re: Design questions/Schema help

Reply via email to