@Naresh Too many indices or indices with high cardinality should be discouraged and are always performance issues. A set will not contain duplicate values.
-Vivek On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <[email protected]> wrote: > @Thunder > I just came to know about > (CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>) > which allows Index on Collections and that will be part of release 2.1. > I hope in that case my problem will be solved by changing your designed > table with tag column as set<text> and defining secondary index on it. Is > there any risk of performance problem of this design keeping in mind huge > data ??? > > > Naresh > > On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <[email protected]>wrote: > >> @Thunder thanks for suggesting design but my main problem is >> indexing/quering dynamic Tag on each row that is main context of each row >> and most of queries will include that.. >> >> As an alternative to cassandra, i tried Apache Blur, in blur table i am >> able to store exact same data and all queries also worked..so blur allows >> dynamic indexing of tag column BUT moving away from cassandra, i am >> loosing its strength because of that i am not confident on this decision as >> data will be huge in my case. >> >> Please guide me on this with better suggestions. >> >> Thanks >> Naresh >> >> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges < >> [email protected]> wrote: >> >>> Well I think you have essentially time-series data, which C* should >>> handle well, however I think your "Tag" column is going to cause troubles. >>> C* does have collection columns, but they are not indexable nor usable in >>> WHERE clause. Your example has both the uniqueness of the data (primary >>> key) and query filtering on potentially multiple "Tag" columns. That is not >>> supported in C* AFAIK.If it were a single Tag, that could be a column that >>> is Indexed possibly. >>> >>> Ignoring that issue with the many different Tags, You could model the >>> table as: >>> >>> CREATE TABLE metric_data ( >>> metric text, >>> time text, >>> period text, >>> tag text, >>> value int, >>> PRIMARY KEY( (metric,time), period, tag) >>> ) >>> >>> That would make a composite partitioning key on metric and time meaning >>> you'd always have to pass those (or else randomly page via TOKEN through >>> all rows). After specifying metric and time, you could optionally also >>> specify period and/or tag, and results would be ordered (clustered) by >>> period. This would satisfy your queries a,b, and d but not c (as you did >>> not specify time). If Time was a granularity column, does it even make >>> sense to return records across differing time values? What does it mean to >>> return the 4 month rows and 1 year row in your example? Could you issue N >>> queries in this case (where N is a small number of each of your time >>> granularities) ? >>> >>> I'm not sure how close that gets you, or if you can re-work your concept >>> of Tag at all. >>> Good luck. >>> Thunder >>> >>> >>> >>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <[email protected]> wrote: >>> >>>> To my eye that looks something what the traditional analytics systems >>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a >>>> backend. >>>> >>>> Cheers, >>>> Hannu >>>> >>>> >>>> 2014/1/9 Naresh Yadav <[email protected]> >>>> >>>>> Hi all, >>>>> >>>>> I have a use case with huge data which i am not able to design in >>>>> cassandra. >>>>> >>>>> Table name : MetricResult >>>>> >>>>> Sample Data : >>>>> >>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, >>>>> Value=10 >>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, >>>>> Value=20 >>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, >>>>> Value=30 >>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, >>>>> Value=10 >>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India, >>>>> Value=90 >>>>> Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, >>>>> Value=70 >>>>> Metric=Cost, Time=Year, Period=2010, Tag=CPU, >>>>> Value=8000 >>>>> Metric=Cost, Time=Year, Period=2010, Tag=RAM, >>>>> Value=4000 >>>>> Metric=Cost, Time=Year Period=2011, Tag=CPU, >>>>> Value=9000 >>>>> Metric=Resource, Time=Week Period=Week1-2013, >>>>> Value=100 >>>>> >>>>> So in above case i have case of >>>>> TimeSeries data i.e Time,Period column >>>>> Dynamic columns i.e Tag column >>>>> Indexing on dynamic columns i.e Tag column >>>>> Aggregations SUM, AVERAGE >>>>> Same value comes again for a Metric, Time, Period, Tag then >>>>> overwrite it >>>>> >>>>> Queries i need to support : >>>>> -------------------------------------- >>>>> a)Give data for Metric=Sales AND Time=Month >>>>> O/P : 5 rows >>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 >>>>> O/P : 2 rows >>>>> c)Give data for Metric=Sales AND Tag=U.S.A >>>>> O/P : 5 rows >>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND >>>>> Tag=Pen >>>>> O/P :1 row >>>>> >>>>> >>>>> This table can have TB's of data and for a Metric,Period can have >>>>> millions of rows. >>>>> >>>>> Please give suggestion to design/model this table in Cassandra. If >>>>> some limitation in Cassandra then suggest best technology to handle this. >>>>> >>>>> >>>>> Thanks >>>>> Naresh >>>>> >>>> >>>> >>> >> >> >> > > >
