Re: Help on Designing Cassandra table for my usecase

Peter Lin Fri, 10 Jan 2014 05:13:01 -0800

indexes on columns with high cardinality is a general database issue, so
it's not unique to cassandra or nosql.



On Fri, Jan 10, 2014 at 7:35 AM, Vivek Mishra <[email protected]> wrote:

> @Naresh
> Too many indices or indices with high cardinality should be discouraged
> and are always performance issues. A set will not contain duplicate values.
>
> -Vivek
>
>
> On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <[email protected]>wrote:
>
>> @Thunder
>> I just came to know about 
>> (CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>)
>> which allows Index on Collections and that will be part of release 2.1.
>> I hope in that case my problem will be solved by changing your designed
>> table with tag column as set<text> and defining secondary index on it. Is
>> there any risk of performance problem of this design keeping in mind huge
>> data ???
>>
>>
>> Naresh
>>
>> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <[email protected]>wrote:
>>
>>> @Thunder thanks for suggesting design but my main problem is
>>> indexing/quering dynamic Tag on each row that is main context of each row
>>> and most of queries will include that..
>>>
>>> As an alternative to cassandra, i tried Apache Blur, in blur table i am
>>> able to store exact same data and all queries also worked..so blur  allows
>>> dynamic indexing  of tag column BUT moving away from cassandra, i am
>>> loosing its strength because of that i am not confident on this decision as
>>> data will be huge in my case.
>>>
>>> Please guide me on this with better suggestions.
>>>
>>> Thanks
>>> Naresh
>>>
>>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
>>> [email protected]> wrote:
>>>
>>>> Well I think you have essentially time-series data, which C* should
>>>> handle well, however I think your "Tag" column is going to cause troubles.
>>>> C* does have collection columns, but they are not indexable nor usable in
>>>> WHERE clause. Your example has both the uniqueness of the data (primary
>>>> key) and query filtering on potentially multiple "Tag" columns. That is not
>>>> supported in C* AFAIK.If it were a single Tag, that could be a column that
>>>> is Indexed possibly.
>>>>
>>>> Ignoring that issue with the many different Tags, You could model the
>>>> table as:
>>>>
>>>> CREATE TABLE metric_data (
>>>>   metric text,
>>>>   time text,
>>>>   period text,
>>>>   tag text,
>>>>   value int,
>>>>   PRIMARY KEY( (metric,time), period, tag)
>>>> )
>>>>
>>>> That would make a composite partitioning key on metric and time meaning
>>>> you'd always have to pass those (or else randomly page via TOKEN through
>>>> all rows). After specifying metric and time, you could optionally also
>>>> specify period and/or tag, and results would be ordered (clustered) by
>>>> period. This would satisfy your queries a,b, and d but not c (as you did
>>>> not specify time). If Time was a granularity column, does it even make
>>>> sense to return records across differing time values? What does it mean to
>>>> return the 4 month rows and 1 year row in your example? Could you issue N
>>>> queries in this case (where N is a small number of each of your time
>>>> granularities) ?
>>>>
>>>> I'm not sure how close that gets you, or if you can re-work your
>>>> concept of Tag at all.
>>>> Good luck.
>>>> Thunder
>>>>
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <[email protected]>wrote:
>>>>
>>>>> To my eye that looks something what the traditional analytics systems
>>>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a
>>>>> backend.
>>>>>
>>>>> Cheers,
>>>>> Hannu
>>>>>
>>>>>
>>>>> 2014/1/9 Naresh Yadav <[email protected]>
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a use case with huge data which i am not able to design in
>>>>>> cassandra.
>>>>>>
>>>>>> Table name : MetricResult
>>>>>>
>>>>>> Sample Data :
>>>>>>
>>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>>>>> Value=10
>>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>>>>> Value=20
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>>>>>> Value=30
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>>>>> Value=10
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>>>>>    Value=90
>>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>>>>      Value=70
>>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>>>>> Value=8000
>>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>>>>> Value=4000
>>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>>>>>    Value=9000
>>>>>> Metric=Resource, Time=Week Period=Week1-2013,
>>>>>> Value=100
>>>>>>
>>>>>> So in above case i have case of
>>>>>>          TimeSeries data  i.e Time,Period column
>>>>>>          Dynamic columns i.e Tag column
>>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>>          Aggregations SUM, AVERAGE
>>>>>>          Same value comes again for a Metric, Time, Period, Tag then
>>>>>> overwrite it
>>>>>>
>>>>>> Queries i need to support :
>>>>>> --------------------------------------
>>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>>        O/P : 5 rows
>>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>>        O/P : 2 rows
>>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>>        O/P : 5 rows
>>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND
>>>>>> Tag=Pen
>>>>>>        O/P :1 row
>>>>>>
>>>>>>
>>>>>> This table can have TB's of data and for a Metric,Period can have
>>>>>> millions of rows.
>>>>>>
>>>>>> Please give suggestion to design/model this table in Cassandra. If
>>>>>> some limitation in Cassandra then suggest best technology to handle this.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Naresh
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>

Re: Help on Designing Cassandra table for my usecase

Reply via email to