Re: Help on Designing Cassandra table for my usecase

Vivek Mishra Fri, 10 Jan 2014 04:36:03 -0800

@Naresh
Too many indices or indices with high cardinality should be discouraged and
are always performance issues. A set will not contain duplicate values.


-Vivek


On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <[email protected]> wrote:

> @Thunder
> I just came to know about 
> (CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>)
> which allows Index on Collections and that will be part of release 2.1.
> I hope in that case my problem will be solved by changing your designed
> table with tag column as set<text> and defining secondary index on it. Is
> there any risk of performance problem of this design keeping in mind huge
> data ???
>
>
> Naresh
>
> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <[email protected]>wrote:
>
>> @Thunder thanks for suggesting design but my main problem is
>> indexing/quering dynamic Tag on each row that is main context of each row
>> and most of queries will include that..
>>
>> As an alternative to cassandra, i tried Apache Blur, in blur table i am
>> able to store exact same data and all queries also worked..so blur  allows
>> dynamic indexing  of tag column BUT moving away from cassandra, i am
>> loosing its strength because of that i am not confident on this decision as
>> data will be huge in my case.
>>
>> Please guide me on this with better suggestions.
>>
>> Thanks
>> Naresh
>>
>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
>> [email protected]> wrote:
>>
>>> Well I think you have essentially time-series data, which C* should
>>> handle well, however I think your "Tag" column is going to cause troubles.
>>> C* does have collection columns, but they are not indexable nor usable in
>>> WHERE clause. Your example has both the uniqueness of the data (primary
>>> key) and query filtering on potentially multiple "Tag" columns. That is not
>>> supported in C* AFAIK.If it were a single Tag, that could be a column that
>>> is Indexed possibly.
>>>
>>> Ignoring that issue with the many different Tags, You could model the
>>> table as:
>>>
>>> CREATE TABLE metric_data (
>>>   metric text,
>>>   time text,
>>>   period text,
>>>   tag text,
>>>   value int,
>>>   PRIMARY KEY( (metric,time), period, tag)
>>> )
>>>
>>> That would make a composite partitioning key on metric and time meaning
>>> you'd always have to pass those (or else randomly page via TOKEN through
>>> all rows). After specifying metric and time, you could optionally also
>>> specify period and/or tag, and results would be ordered (clustered) by
>>> period. This would satisfy your queries a,b, and d but not c (as you did
>>> not specify time). If Time was a granularity column, does it even make
>>> sense to return records across differing time values? What does it mean to
>>> return the 4 month rows and 1 year row in your example? Could you issue N
>>> queries in this case (where N is a small number of each of your time
>>> granularities) ?
>>>
>>> I'm not sure how close that gets you, or if you can re-work your concept
>>> of Tag at all.
>>> Good luck.
>>> Thunder
>>>
>>>
>>>
>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <[email protected]> wrote:
>>>
>>>> To my eye that looks something what the traditional analytics systems
>>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a
>>>> backend.
>>>>
>>>> Cheers,
>>>> Hannu
>>>>
>>>>
>>>> 2014/1/9 Naresh Yadav <[email protected]>
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a use case with huge data which i am not able to design in
>>>>> cassandra.
>>>>>
>>>>> Table name : MetricResult
>>>>>
>>>>> Sample Data :
>>>>>
>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>>>> Value=10
>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>>>> Value=20
>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>>>>> Value=30
>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>>>> Value=10
>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>>>>    Value=90
>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>>>    Value=70
>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>>>> Value=8000
>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>>>> Value=4000
>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>>>> Value=9000
>>>>> Metric=Resource, Time=Week Period=Week1-2013,
>>>>> Value=100
>>>>>
>>>>> So in above case i have case of
>>>>>          TimeSeries data  i.e Time,Period column
>>>>>          Dynamic columns i.e Tag column
>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>          Aggregations SUM, AVERAGE
>>>>>          Same value comes again for a Metric, Time, Period, Tag then
>>>>> overwrite it
>>>>>
>>>>> Queries i need to support :
>>>>> --------------------------------------
>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>        O/P : 5 rows
>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>        O/P : 2 rows
>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>        O/P : 5 rows
>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND
>>>>> Tag=Pen
>>>>>        O/P :1 row
>>>>>
>>>>>
>>>>> This table can have TB's of data and for a Metric,Period can have
>>>>> millions of rows.
>>>>>
>>>>> Please give suggestion to design/model this table in Cassandra. If
>>>>> some limitation in Cassandra then suggest best technology to handle this.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Naresh
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
>
>

Re: Help on Designing Cassandra table for my usecase

Reply via email to