Re: Help on Designing Cassandra table for my usecase

Thunder Stumpges Fri, 10 Jan 2014 07:05:57 -0800

It does sound like that could work for you. From the sample data it doesn't 
look like tag will be high cardinality (relative to number of rows) so as long 
as you won't have rows with too many tags (collections are best kept small, but 
they claim can be in the hundreds but not to exceed 64k) I don't have any 
experience with secondary indexes under load and definitely not with 
collections.


Looks promising though!
Good luck,
Thunder



> On Jan 10, 2014, at 5:02 AM, Naresh Yadav <[email protected]> wrote:
> 
> @vivek thanks for pointing that out..Other than primary key defining only one 
> secondary index tags and in my case same tags will be repeating itself across 
> period for sure for a metric=Sales AND also across metric Sales, Cost also 
> can be same set of tags to some extent not always..
> 
> 
> Thanks
> Naresh
> 
> 
>> On Fri, Jan 10, 2014 at 6:05 PM, Vivek Mishra <[email protected]> wrote:
>> @Naresh
>> Too many indices or indices with high cardinality should be discouraged and 
>> are always performance issues. A set will not contain duplicate values.
>> 
>> -Vivek
>> 
>> 
>>> On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <[email protected]> wrote:
>>> @Thunder
>>> I just came to know about (CASSANDRA-4511) which allows Index on 
>>> Collections and that will be part of release 2.1.
>>> I hope in that case my problem will be solved by changing your designed 
>>> table with tag column as set<text> and defining secondary index on it. Is 
>>> there any risk of performance problem of this design keeping in mind huge 
>>> data ???
>>> 
>>> 
>>> Naresh
>>> 
>>>> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <[email protected]> 
>>>> wrote:
>>>> @Thunder thanks for suggesting design but my main problem is 
>>>> indexing/quering dynamic Tag on each row that is main context of each row 
>>>> and most of queries will include that..
>>>> 
>>>> As an alternative to cassandra, i tried Apache Blur, in blur table i am 
>>>> able to store exact same data and all queries also worked..so blur  allows 
>>>> dynamic indexing  of tag column BUT moving away from cassandra, i am 
>>>> loosing its strength because of that i am not confident on this decision 
>>>> as data will be huge in my case.
>>>> 
>>>> Please guide me on this with better suggestions.
>>>> 
>>>> Thanks
>>>> Naresh
>>>> 
>>>>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges 
>>>>> <[email protected]> wrote:
>>>>> Well I think you have essentially time-series data, which C* should 
>>>>> handle well, however I think your "Tag" column is going to cause 
>>>>> troubles. C* does have collection columns, but they are not indexable nor 
>>>>> usable in WHERE clause. Your example has both the uniqueness of the data 
>>>>> (primary key) and query filtering on potentially multiple "Tag" columns. 
>>>>> That is not supported in C* AFAIK.If it were a single Tag, that could be 
>>>>> a column that is Indexed possibly. 
>>>>> 
>>>>> Ignoring that issue with the many different Tags, You could model the 
>>>>> table as:
>>>>> 
>>>>> CREATE TABLE metric_data (
>>>>>   metric text,
>>>>>   time text,
>>>>>   period text,
>>>>>   tag text,
>>>>>   value int,
>>>>>   PRIMARY KEY( (metric,time), period, tag)
>>>>> )
>>>>> 
>>>>> That would make a composite partitioning key on metric and time meaning 
>>>>> you'd always have to pass those (or else randomly page via TOKEN through 
>>>>> all rows). After specifying metric and time, you could optionally also 
>>>>> specify period and/or tag, and results would be ordered (clustered) by 
>>>>> period. This would satisfy your queries a,b, and d but not c (as you did 
>>>>> not specify time). If Time was a granularity column, does it even make 
>>>>> sense to return records across differing time values? What does it mean 
>>>>> to return the 4 month rows and 1 year row in your example? Could you 
>>>>> issue N queries in this case (where N is a small number of each of your 
>>>>> time granularities) ?
>>>>> 
>>>>> I'm not sure how close that gets you, or if you can re-work your concept 
>>>>> of Tag at all.
>>>>> Good luck.
>>>>> Thunder
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <[email protected]> wrote:
>>>>>> To my eye that looks something what the traditional analytics systems 
>>>>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a 
>>>>>> backend.
>>>>>> 
>>>>>> Cheers,
>>>>>> Hannu
>>>>>> 
>>>>>> 
>>>>>> 2014/1/9 Naresh Yadav <[email protected]>
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I have a use case with huge data which i am not able to design in 
>>>>>>> cassandra.
>>>>>>> 
>>>>>>> Table name : MetricResult      
>>>>>>> 
>>>>>>> Sample Data :
>>>>>>> 
>>>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,     
>>>>>>> Value=10
>>>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  
>>>>>>> Value=20
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     
>>>>>>> Value=30
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  
>>>>>>> Value=10
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,                     
>>>>>>>  Value=90
>>>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,                  
>>>>>>>   Value=70
>>>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,                     
>>>>>>> Value=8000
>>>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,                    
>>>>>>> Value=4000
>>>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,                     
>>>>>>> Value=9000
>>>>>>> Metric=Resource, Time=Week Period=Week1-2013,                      
>>>>>>> Value=100
>>>>>>> 
>>>>>>> So in above case i have case of 
>>>>>>>          TimeSeries data  i.e Time,Period column
>>>>>>>          Dynamic columns i.e Tag column
>>>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>>>          Aggregations SUM, AVERAGE
>>>>>>>          Same value comes again for a Metric, Time, Period, Tag then 
>>>>>>> overwrite it 
>>>>>>> 
>>>>>>> Queries i need to support :
>>>>>>> --------------------------------------
>>>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>>>        O/P : 5 rows
>>>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>>>        O/P : 2 rows
>>>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>>>        O/P : 5 rows
>>>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>>>>>>>        O/P :1 row
>>>>>>> 
>>>>>>> 
>>>>>>> This table can have TB's of data and for a Metric,Period can have 
>>>>>>> millions of rows.
>>>>>>> 
>>>>>>> Please give suggestion to design/model this table in Cassandra. If some 
>>>>>>> limitation in Cassandra then suggest best technology to handle this.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Naresh
> 
> 
>

Re: Help on Designing Cassandra table for my usecase

Reply via email to