indexes on columns with high cardinality is a general database issue, so it's not unique to cassandra or nosql.
On Fri, Jan 10, 2014 at 7:35 AM, Vivek Mishra <[email protected]> wrote: > @Naresh > Too many indices or indices with high cardinality should be discouraged > and are always performance issues. A set will not contain duplicate values. > > -Vivek > > > On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <[email protected]>wrote: > >> @Thunder >> I just came to know about >> (CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>) >> which allows Index on Collections and that will be part of release 2.1. >> I hope in that case my problem will be solved by changing your designed >> table with tag column as set<text> and defining secondary index on it. Is >> there any risk of performance problem of this design keeping in mind huge >> data ??? >> >> >> Naresh >> >> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <[email protected]>wrote: >> >>> @Thunder thanks for suggesting design but my main problem is >>> indexing/quering dynamic Tag on each row that is main context of each row >>> and most of queries will include that.. >>> >>> As an alternative to cassandra, i tried Apache Blur, in blur table i am >>> able to store exact same data and all queries also worked..so blur allows >>> dynamic indexing of tag column BUT moving away from cassandra, i am >>> loosing its strength because of that i am not confident on this decision as >>> data will be huge in my case. >>> >>> Please guide me on this with better suggestions. >>> >>> Thanks >>> Naresh >>> >>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges < >>> [email protected]> wrote: >>> >>>> Well I think you have essentially time-series data, which C* should >>>> handle well, however I think your "Tag" column is going to cause troubles. >>>> C* does have collection columns, but they are not indexable nor usable in >>>> WHERE clause. Your example has both the uniqueness of the data (primary >>>> key) and query filtering on potentially multiple "Tag" columns. That is not >>>> supported in C* AFAIK.If it were a single Tag, that could be a column that >>>> is Indexed possibly. >>>> >>>> Ignoring that issue with the many different Tags, You could model the >>>> table as: >>>> >>>> CREATE TABLE metric_data ( >>>> metric text, >>>> time text, >>>> period text, >>>> tag text, >>>> value int, >>>> PRIMARY KEY( (metric,time), period, tag) >>>> ) >>>> >>>> That would make a composite partitioning key on metric and time meaning >>>> you'd always have to pass those (or else randomly page via TOKEN through >>>> all rows). After specifying metric and time, you could optionally also >>>> specify period and/or tag, and results would be ordered (clustered) by >>>> period. This would satisfy your queries a,b, and d but not c (as you did >>>> not specify time). If Time was a granularity column, does it even make >>>> sense to return records across differing time values? What does it mean to >>>> return the 4 month rows and 1 year row in your example? Could you issue N >>>> queries in this case (where N is a small number of each of your time >>>> granularities) ? >>>> >>>> I'm not sure how close that gets you, or if you can re-work your >>>> concept of Tag at all. >>>> Good luck. >>>> Thunder >>>> >>>> >>>> >>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <[email protected]>wrote: >>>> >>>>> To my eye that looks something what the traditional analytics systems >>>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a >>>>> backend. >>>>> >>>>> Cheers, >>>>> Hannu >>>>> >>>>> >>>>> 2014/1/9 Naresh Yadav <[email protected]> >>>>> >>>>>> Hi all, >>>>>> >>>>>> I have a use case with huge data which i am not able to design in >>>>>> cassandra. >>>>>> >>>>>> Table name : MetricResult >>>>>> >>>>>> Sample Data : >>>>>> >>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, >>>>>> Value=10 >>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, >>>>>> Value=20 >>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, >>>>>> Value=30 >>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, >>>>>> Value=10 >>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India, >>>>>> Value=90 >>>>>> Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, >>>>>> Value=70 >>>>>> Metric=Cost, Time=Year, Period=2010, Tag=CPU, >>>>>> Value=8000 >>>>>> Metric=Cost, Time=Year, Period=2010, Tag=RAM, >>>>>> Value=4000 >>>>>> Metric=Cost, Time=Year Period=2011, Tag=CPU, >>>>>> Value=9000 >>>>>> Metric=Resource, Time=Week Period=Week1-2013, >>>>>> Value=100 >>>>>> >>>>>> So in above case i have case of >>>>>> TimeSeries data i.e Time,Period column >>>>>> Dynamic columns i.e Tag column >>>>>> Indexing on dynamic columns i.e Tag column >>>>>> Aggregations SUM, AVERAGE >>>>>> Same value comes again for a Metric, Time, Period, Tag then >>>>>> overwrite it >>>>>> >>>>>> Queries i need to support : >>>>>> -------------------------------------- >>>>>> a)Give data for Metric=Sales AND Time=Month >>>>>> O/P : 5 rows >>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 >>>>>> O/P : 2 rows >>>>>> c)Give data for Metric=Sales AND Tag=U.S.A >>>>>> O/P : 5 rows >>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND >>>>>> Tag=Pen >>>>>> O/P :1 row >>>>>> >>>>>> >>>>>> This table can have TB's of data and for a Metric,Period can have >>>>>> millions of rows. >>>>>> >>>>>> Please give suggestion to design/model this table in Cassandra. If >>>>>> some limitation in Cassandra then suggest best technology to handle this. >>>>>> >>>>>> >>>>>> Thanks >>>>>> Naresh >>>>>> >>>>> >>>>> >>>> >>> >>> >>> >> >> >> >
