Re: Help on Designing Cassandra table for my usecase

Naresh Yadav Fri, 10 Jan 2014 04:20:04 -0800

@Thunder
I just came to know about
(CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>)
which allows Index on Collections and that will be part of release 2.1.
I hope in that case my problem will be solved by changing your designed
table with tag column as set<text> and defining secondary index on it. Is
there any risk of performance problem of this design keeping in mind huge
data ???



Naresh

On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <[email protected]> wrote:

> @Thunder thanks for suggesting design but my main problem is
> indexing/quering dynamic Tag on each row that is main context of each row
> and most of queries will include that..
>
> As an alternative to cassandra, i tried Apache Blur, in blur table i am
> able to store exact same data and all queries also worked..so blur  allows
> dynamic indexing  of tag column BUT moving away from cassandra, i am
> loosing its strength because of that i am not confident on this decision as
> data will be huge in my case.
>
> Please guide me on this with better suggestions.
>
> Thanks
> Naresh
>
> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
> [email protected]> wrote:
>
>> Well I think you have essentially time-series data, which C* should
>> handle well, however I think your "Tag" column is going to cause troubles.
>> C* does have collection columns, but they are not indexable nor usable in
>> WHERE clause. Your example has both the uniqueness of the data (primary
>> key) and query filtering on potentially multiple "Tag" columns. That is not
>> supported in C* AFAIK.If it were a single Tag, that could be a column that
>> is Indexed possibly.
>>
>> Ignoring that issue with the many different Tags, You could model the
>> table as:
>>
>> CREATE TABLE metric_data (
>>   metric text,
>>   time text,
>>   period text,
>>   tag text,
>>   value int,
>>   PRIMARY KEY( (metric,time), period, tag)
>> )
>>
>> That would make a composite partitioning key on metric and time meaning
>> you'd always have to pass those (or else randomly page via TOKEN through
>> all rows). After specifying metric and time, you could optionally also
>> specify period and/or tag, and results would be ordered (clustered) by
>> period. This would satisfy your queries a,b, and d but not c (as you did
>> not specify time). If Time was a granularity column, does it even make
>> sense to return records across differing time values? What does it mean to
>> return the 4 month rows and 1 year row in your example? Could you issue N
>> queries in this case (where N is a small number of each of your time
>> granularities) ?
>>
>> I'm not sure how close that gets you, or if you can re-work your concept
>> of Tag at all.
>> Good luck.
>> Thunder
>>
>>
>>
>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <[email protected]> wrote:
>>
>>> To my eye that looks something what the traditional analytics systems
>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a
>>> backend.
>>>
>>> Cheers,
>>> Hannu
>>>
>>>
>>> 2014/1/9 Naresh Yadav <[email protected]>
>>>
>>>> Hi all,
>>>>
>>>> I have a use case with huge data which i am not able to design in
>>>> cassandra.
>>>>
>>>> Table name : MetricResult
>>>>
>>>> Sample Data :
>>>>
>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>>> Value=10
>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>>> Value=20
>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>>>> Value=30
>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>>> Value=10
>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>>>    Value=90
>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>>    Value=70
>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>>> Value=8000
>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>>> Value=4000
>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>>> Value=9000
>>>> Metric=Resource, Time=Week Period=Week1-2013,
>>>> Value=100
>>>>
>>>> So in above case i have case of
>>>>          TimeSeries data  i.e Time,Period column
>>>>          Dynamic columns i.e Tag column
>>>>          Indexing on dynamic columns i.e Tag column
>>>>          Aggregations SUM, AVERAGE
>>>>          Same value comes again for a Metric, Time, Period, Tag then
>>>> overwrite it
>>>>
>>>> Queries i need to support :
>>>> --------------------------------------
>>>> a)Give data for Metric=Sales AND Time=Month
>>>>        O/P : 5 rows
>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>        O/P : 2 rows
>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>        O/P : 5 rows
>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>>>>        O/P :1 row
>>>>
>>>>
>>>> This table can have TB's of data and for a Metric,Period can have
>>>> millions of rows.
>>>>
>>>> Please give suggestion to design/model this table in Cassandra. If some
>>>> limitation in Cassandra then suggest best technology to handle this.
>>>>
>>>>
>>>> Thanks
>>>> Naresh
>>>>
>>>
>>>
>>
>
>
>

Re: Help on Designing Cassandra table for my usecase

Reply via email to