Re: RE Ordering counters in Cassandra

2012-05-23 Thread aaron morton
 Just out of curiosity, is there any underlying architectural reason why it's 
 not possible to order a row based on its counters values? or is it something 
 that might be in the roadmap in the future?
it wouldn't work well with the consistency level. 
Also, sorting a list of values at the same time you want multiple clients to be 
modifying them would not work very well. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/05/2012, at 12:25 AM, samal wrote:

 Secondary index is not supported for counters plus you must know column name 
 to support secondary index on regular column.
 
 On 22-May-2012 5:34 PM, Filippo Diotalevi fili...@ntoklo.com wrote:
 Thanks for all the answers, they definitely helped.
 
 Just out of curiosity, is there any underlying architectural reason why it's 
 not possible to order a row based on its counters values? or is it something 
 that might be in the roadmap in the future?
 
 -- 
 Filippo Diotalevi
 
 On Tuesday, 22 May 2012 at 08:48, Romain HARDOUIN wrote:
 
 
 I mean iterate over each column -- more precisly: *bunches of columns* using 
 slices -- and write new columns in the inversed index. 
 Tamar's data model is made for real time analysis. It's maybe overdesigned 
 for a daily ranking. 
 I agree with Samal, you should split your data across the space of tokens. 
 Only CF Ranking feeding would be affected, not the top N queries. 
 
 Filippo Diotalevi fili...@ntoklo.com a écrit sur 21/05/2012 19:05:28 :
 
  Hi Romain, 
  thanks for your suggestion. 
  
  When you say  build every day a ranking in a dedicated CF by 
  iterating over events: do you mean 
  - load all the columns for the specified row key 
  - iterate over each column, and write a new column in the inversed index 
  ? 
  
  That's my current approach, but since I have many of these wide rows
  (1 per day), the process is extremely slow as it involves moving an 
  entire row from Cassandra to client, inverting every column, and 
  sending the data back to create the inversed index. 
 



Re: RE Ordering counters in Cassandra

2012-05-22 Thread samal
In some cases Cassandra is really good and in some cases it is not.

The way I see your approach is your are recording all of your events in
single key is it? Not recommended. It can go really big also if your have
cluster of servers, It will hit only one server all the time make it
overwhelm, and rest will sit ideal, take a nap.

I will do like, I will figure out what are similar events that is
occurring, and then bucket by those event.

eg: if event is occurred from IOS or Andriod. I will bucket by IOS and
android KEY, so here counter will give me all events occurred from  IOS or
andriod.

KEY, concat can also be use to filter out more deep: IOS#safari,
andriod#chrome.

Less number of columns will help to reverse index more efficiently.

/Samal

On Mon, May 21, 2012 at 11:53 PM, Tamar Fraenkel ta...@tok-media.comwrote:

 Indeed I took the not delete approach. If time bucket rows are not that
 big, this is a good temporary solution.
 I just finished implementation and testing now on a small staging
 environment. So far so good.
 Tamar

 Sent from my iPod

 On May 21, 2012, at 9:11 PM, Filippo Diotalevi fili...@ntoklo.com wrote:

  Hi Tamar,
 the solution you propose is indeed a temporary solution, but it might be
 the best one.

 Which approach did you follow?
 I'm a bit concerned about the deletion approach, since in case of
 concurrent writes on the same counter you might lose the pointer to the
 column to delete.

 --
 Filippo Diotalevi


 On Monday, 21 May 2012 at 18:51, Tamar Fraenkel wrote:

 I also had a similar problem. I have a temporary solution, which is not
 best, but may be of help.
 I have the coutner cf to count events, but apart from that I hold leaders
 CF:

 leaders = {
   // key is time bucket
   // values are composites(rank, event) ordered by
   // descending order of the rank
   // set relevant TTL on columns
   time_bucket1 : {
 composite(1000,event1) : 
 composite(999, event2) : 
   },
   ...
 }

 Whenever I increment counter for a specific event, I add a column in the
 time bucket row of the leaders CF, with the new value of the counter and
 the event name.
 There are two ways to go here, either delete the old column(s) for that
 event (with lower counters) from leaders CF. Or let them be.
 If you choose to delete, there is the complication of not having getAndSetfor 
 counters, so you may end up not deleting all the old columns.
 If you choose not to  delete old column, and live with duplicate columns
 for events (each with different count), it will make your query to
 retrieve leaders run longer.
 Anyway, when you need to retrieve the leaders, you can do slice query 
 onleaders CF and ignore
 duplicates events using client (I use Java). This will happen less if you
 do delete old columns.

 Another option is not to use Cassandra for that purpose, http://redis.io/ is
 a nice tool

 Will be happy to hear you comments.
 Thanks,

 *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 tokLogo.png


 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956





 On Mon, May 21, 2012 at 8:05 PM, Filippo Diotalevi fili...@ntoklo.comwrote:

 Hi Romain,
 thanks for your suggestion.

 When you say  build every day a ranking in a dedicated CF by iterating
 over events: do you mean
 - load all the columns for the specified row key
 - iterate over each column, and write a new column in the inversed index
 ?

 That's my current approach, but since I have many of these wide rows (1
 per day), the process is extremely slow as it involves moving an entire row
 from Cassandra to client, inverting every column, and sending the data back
 to create the inversed index.

 --
 Filippo Diotalevi


 On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote:


 If I understand you've got a data model which looks like this:

 CF Events:
 row1: { event1: 1050, event2: 1200, event3: 830, ... }

 You can't query on column values but you can build every day a ranking in
 a dedicated CF by iterating over events:

 create column family Ranking
 with comparator = 'LongType(reversed=true)'
 ...

 CF Ranking:
 rank: { 1200: event2, 1050: event1, 830: event3, ... }

 Then you can make a top ten or whatever you want because counter values
 will be sorted.


 Filippo Diotalevi fili...@ntoklo.com a écrit sur 21/05/2012 16:59:43 :

  Hi,
  I'm trying to understand what's the best design for a simple
  ranking use cases.
  I have, in a row, a good number (10k - a few 100K) of counters; each
  one is counting the occurrence of an event. At the end of day, I
  want to create a ranking of the most occurred event.
 
  What's the best approach to perform this task?
  The brute force approach of retrieving the row and ordering it
  doesn't work well (the call usually goes timeout, especially is
  Cassandra is also under load); I also don't know in advance the full
  set of event names (column names), so it's difficult to slice the get
 call.
 
  Is there any trick to 

Re: RE Ordering counters in Cassandra

2012-05-22 Thread Filippo Diotalevi
Thanks for all the answers, they definitely helped.  

Just out of curiosity, is there any underlying architectural reason why it's 
not possible to order a row based on its counters values? or is it something 
that might be in the roadmap in the future?  

--  
Filippo Diotalevi


On Tuesday, 22 May 2012 at 08:48, Romain HARDOUIN wrote:

  
 I mean iterate over each column -- more precisly: *bunches of columns* using 
 slices -- and write new columns in the inversed index.  
 Tamar's data model is made for real time analysis. It's maybe overdesigned 
 for a daily ranking.  
 I agree with Samal, you should split your data across the space of tokens. 
 Only CF Ranking feeding would be affected, not the top N queries.  
  
 Filippo Diotalevi fili...@ntoklo.com (mailto:fili...@ntoklo.com) a écrit 
 sur 21/05/2012 19:05:28 :
  
  Hi Romain,  
  thanks for your suggestion.  
   
  When you say  build every day a ranking in a dedicated CF by  
  iterating over events: do you mean  
  - load all the columns for the specified row key  
  - iterate over each column, and write a new column in the inversed index  
  ?  
   
  That's my current approach, but since I have many of these wide rows
  (1 per day), the process is extremely slow as it involves moving an  
  entire row from Cassandra to client, inverting every column, and  
  sending the data back to create the inversed index.  



Re: RE Ordering counters in Cassandra

2012-05-22 Thread samal
Secondary index is not supported for counters plus you must know column
name to support secondary index on regular column.
On 22-May-2012 5:34 PM, Filippo Diotalevi fili...@ntoklo.com wrote:

  Thanks for all the answers, they definitely helped.

 Just out of curiosity, is there any underlying architectural reason why
 it's not possible to order a row based on its counters values? or is it
 something that might be in the roadmap in the future?

 --
 Filippo Diotalevi

 On Tuesday, 22 May 2012 at 08:48, Romain HARDOUIN wrote:


 I mean iterate over each column -- more precisly: *bunches of columns*
 using slices -- and write new columns in the inversed index.
 Tamar's data model is made for real time analysis. It's maybe overdesigned
 for a daily ranking.
 I agree with Samal, you should split your data across the space of tokens.
 Only CF Ranking feeding would be affected, not the top N queries.

 Filippo Diotalevi fili...@ntoklo.com a écrit sur 21/05/2012 19:05:28 :

  Hi Romain,
  thanks for your suggestion.
 
  When you say  build every day a ranking in a dedicated CF by
  iterating over events: do you mean
  - load all the columns for the specified row key
  - iterate over each column, and write a new column in the inversed index
  ?
 
  That's my current approach, but since I have many of these wide rows
  (1 per day), the process is extremely slow as it involves moving an
  entire row from Cassandra to client, inverting every column, and
  sending the data back to create the inversed index.





Re: RE Ordering counters in Cassandra

2012-05-21 Thread Filippo Diotalevi
Hi Romain,  
thanks for your suggestion.

When you say  build every day a ranking in a dedicated CF by iterating over 
events: do you mean
- load all the columns for the specified row key
- iterate over each column, and write a new column in the inversed index
?

That's my current approach, but since I have many of these wide rows (1 per 
day), the process is extremely slow as it involves moving an entire row from 
Cassandra to client, inverting every column, and sending the data back to 
create the inversed index.  

--  
Filippo Diotalevi



On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote:

  
 If I understand you've got a data model which looks like this:  
  
 CF Events:  
 row1: { event1: 1050, event2: 1200, event3: 830, ... }  
  
 You can't query on column values but you can build every day a ranking in a 
 dedicated CF by iterating over events:  
  
 create column family Ranking  
 with comparator = 'LongType(reversed=true)'
 ...  
  
 CF Ranking:  
 rank: { 1200: event2, 1050: event1, 830: event3, ... }  
  
 Then you can make a top ten or whatever you want because counter values 
 will be sorted.  
  
  
 Filippo Diotalevi fili...@ntoklo.com (mailto:fili...@ntoklo.com) a écrit 
 sur 21/05/2012 16:59:43 :
  
  Hi,  
  I'm trying to understand what's the best design for a simple  
  ranking use cases.  
  I have, in a row, a good number (10k - a few 100K) of counters; each
  one is counting the occurrence of an event. At the end of day, I  
  want to create a ranking of the most occurred event.  
   
  What's the best approach to perform this task?  
  The brute force approach of retrieving the row and ordering it  
  doesn't work well (the call usually goes timeout, especially is  
  Cassandra is also under load); I also don't know in advance the full
  set of event names (column names), so it's difficult to slice the get call. 
   
   
  Is there any trick to solve this problem? Maybe a way to retrieve  
  the row ordering for counter values?  
   
  Thanks,  
  --  
  Filippo Diotalevi  



Re: RE Ordering counters in Cassandra

2012-05-21 Thread Tamar Fraenkel
I also had a similar problem. I have a temporary solution, which is not
best, but may be of help.
I have the coutner cf to count events, but apart from that I hold leaders
CF:

leaders = {
  // key is time bucket
  // values are composites(rank, event) ordered by
  // descending order of the rank
  // set relevant TTL on columns
  time_bucket1 : {
composite(1000,event1) : 
composite(999, event2) : 
  },
  ...
}

Whenever I increment counter for a specific event, I add a column in the
time bucket row of the leaders CF, with the new value of the counter and
the event name.
There are two ways to go here, either delete the old column(s) for that
event (with lower counters) from leaders CF. Or let them be.
If you choose to delete, there is the complication of not having
getAndSetfor counters, so you may end up not deleting all the old
columns.
If you choose not to  delete old column, and live with duplicate columns
for events (each with different count), it will make your query to retrieve
leaders run longer.
Anyway, when you need to retrieve the leaders, you can do slice query
onleaders CF and ignore
duplicates events using client (I use Java). This will happen less if you
do delete old columns.

Another option is not to use Cassandra for that purpose, http://redis.io/ is
a nice tool

Will be happy to hear you comments.
Thanks,

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Mon, May 21, 2012 at 8:05 PM, Filippo Diotalevi fili...@ntoklo.comwrote:

 Hi Romain,
 thanks for your suggestion.

 When you say  build every day a ranking in a dedicated CF by iterating
 over events: do you mean
 - load all the columns for the specified row key
 - iterate over each column, and write a new column in the inversed index
 ?

 That's my current approach, but since I have many of these wide rows (1
 per day), the process is extremely slow as it involves moving an entire row
 from Cassandra to client, inverting every column, and sending the data back
 to create the inversed index.

 --
 Filippo Diotalevi


 On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote:


 If I understand you've got a data model which looks like this:

 CF Events:
 row1: { event1: 1050, event2: 1200, event3: 830, ... }

 You can't query on column values but you can build every day a ranking in
 a dedicated CF by iterating over events:

 create column family Ranking
 with comparator = 'LongType(reversed=true)'
 ...

 CF Ranking:
 rank: { 1200: event2, 1050: event1, 830: event3, ... }

 Then you can make a top ten or whatever you want because counter values
 will be sorted.


 Filippo Diotalevi fili...@ntoklo.com a écrit sur 21/05/2012 16:59:43 :

  Hi,
  I'm trying to understand what's the best design for a simple
  ranking use cases.
  I have, in a row, a good number (10k - a few 100K) of counters; each
  one is counting the occurrence of an event. At the end of day, I
  want to create a ranking of the most occurred event.
 
  What's the best approach to perform this task?
  The brute force approach of retrieving the row and ordering it
  doesn't work well (the call usually goes timeout, especially is
  Cassandra is also under load); I also don't know in advance the full
  set of event names (column names), so it's difficult to slice the get
 call.
 
  Is there any trick to solve this problem? Maybe a way to retrieve
  the row ordering for counter values?
 
  Thanks,
  --
  Filippo Diotalevi



tokLogo.png

Re: RE Ordering counters in Cassandra

2012-05-21 Thread Filippo Diotalevi


Hi Tamar,
the solution you propose is indeed a "temporary solution", but it might be the best one.Which approach did you follow?I'm a bit concerned about the deletion approach, since in case of concurrent writes on the same counter you might "lose" the pointer to the column to delete.
--Filippo Diotalevi
 
On Monday, 21 May 2012 at 18:51, Tamar Fraenkel wrote:

I also had a similar problem. I have a temporary solution, which is not best, but may be of help.I have the coutner cfto count events, but apart from that I hold leaders CF:

leaders = {
  // key is time bucket
  // values are composites(rank, event) ordered by
  // descending order of the rank
  // set relevant TTL on columns
  time_bucket1 : {
composite(1000,event1) : ""
composite(999, event2) : ""
  },
  ...
}Whenever I increment counter for a specific event, I add a column in the time bucket row of the leaders CF, with the new value of the counter and the event name.There are two ways to go here, either delete the old column(s) for that event (with lower counters) from leaders CF. Or let them be.

If you choose to delete, there is the complication of not havinggetAndSet for counters, so you may end up not deleting all the old columns.

If you choose not to delete old column, and live with duplicate columns for events (each with different count), it will make your query to retrieve leaders run longer.

Anyway, when you need to retrieve the leaders, you can do slice query on leaders CF and ignore duplicates events using client (I use Java). This will happen less if you do delete old columns.

Another option is not to use Cassandra for that purpose,http://redis.io/is a nice tool

Will be happy to hear you comments.

Thanks,Tamar FraenkelSenior Software Engineer, TOK Media

ta...@tok-media.comTel:+972 2 6409736Mob:+972 54 8356490

Fax:+972 2 5612956
On Mon, May 21, 2012 at 8:05 PM, Filippo Diotalevi fili...@ntoklo.com wrote:


Hi Romain,
thanks for your suggestion.When you say "build every day a ranking in a dedicated CF by iterating over events:" do you mean- load all the columns for the specified row key

- iterate over each column, and write a new column in the inversed index?That's my current approach, but since I have many of these wide rows (1 per day), the process is extremely slow as it involves moving an entire row from Cassandra to client, inverting every column, and sending the data back to create the inversed index.


--Filippo Diotalevi
  
On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote:

If I understand you've got a data model
which looks like this:

CF Events:
  "row1": { "event1":
1050, "event2": 1200, "event3": 830, ... }

You can't query on column values but
you can build every day a ranking in a dedicated CF by iterating over events:

create column family Ranking
  with comparator = 'LongType(reversed=true)'

  ...

CF Ranking:
  "rank": { 1200:
"event2", 1050: "event1", 830: "event3",
... }
  
Then you can make a "top ten"
or whatever you want because counter values will be sorted.


Filippo Diotalevi fili...@ntoklo.com a écrit
sur 21/05/2012 16:59:43 :

 Hi, 
 I'm trying to understand what's the best design
for a simple 
 "ranking" use cases.
 I have, in a row, a good number (10k - a few
100K) of counters; each
 one is counting the occurrence of an event. At the end of day, I 
 want to create a ranking of the most occurred event.
 
 What's the best approach to perform this task? 
 The brute force approach of retrieving the row
and ordering it 
 doesn't work well (the call usually goes timeout, especially is 
 Cassandra is also under load); I also don't know in advance the full
 set of event names (column names), so it's difficult to slice the
get call.
 
 Is there any trick to solve this problem? Maybe a way to retrieve

 the row ordering for counter values?
 
 Thanks,
 -- 
 Filippo Diotalevi
  
  
  
  





 
 
 
 

 





Re: RE Ordering counters in Cassandra

2012-05-21 Thread Tamar Fraenkel
Indeed I took the not delete approach. If time bucket rows are not that big, 
this is a good temporary solution.
I just finished implementation and testing now on a small staging environment. 
So far so good.
Tamar

Sent from my iPod

On May 21, 2012, at 9:11 PM, Filippo Diotalevi fili...@ntoklo.com wrote:

 Hi Tamar,
 the solution you propose is indeed a temporary solution, but it might be 
 the best one.
 
 Which approach did you follow?
 I'm a bit concerned about the deletion approach, since in case of concurrent 
 writes on the same counter you might lose the pointer to the column to 
 delete. 
 
 -- 
 Filippo Diotalevi
 
 
 On Monday, 21 May 2012 at 18:51, Tamar Fraenkel wrote:
 
 I also had a similar problem. I have a temporary solution, which is not 
 best, but may be of help.
 I have the coutner cf to count events, but apart from that I hold leaders CF:
 leaders = {
   // key is time bucket
   // values are composites(rank, event) ordered by
   // descending order of the rank
   // set relevant TTL on columns
   time_bucket1 : {
 composite(1000,event1) : 
 composite(999, event2) : 
   },
   ...
 }
 Whenever I increment counter for a specific event, I add a column in the 
 time bucket row of the leaders CF, with the new value of the counter and the 
 event name.
 There are two ways to go here, either delete the old column(s) for that 
 event (with lower counters) from leaders CF. Or let them be. 
 If you choose to delete, there is the complication of not having getAndSet 
 for counters, so you may end up not deleting all the old columns. 
 If you choose not to  delete old column, and live with duplicate columns for 
 events (each with different count), it will make your query to retrieve 
 leaders run longer.
 Anyway, when you need to retrieve the leaders, you can do slice query on 
 leaders CF and ignore duplicates events using client (I use Java). This will 
 happen less if you do delete old columns.
 
 Another option is not to use Cassandra for that purpose, http://redis.io/ is 
 a nice tool
 
 Will be happy to hear you comments.
 Thanks,
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 
 On Mon, May 21, 2012 at 8:05 PM, Filippo Diotalevi fili...@ntoklo.com 
 wrote:
 Hi Romain,
 thanks for your suggestion.
 
 When you say  build every day a ranking in a dedicated CF by iterating 
 over events: do you mean
 - load all the columns for the specified row key
 - iterate over each column, and write a new column in the inversed index
 ?
 
 That's my current approach, but since I have many of these wide rows (1 per 
 day), the process is extremely slow as it involves moving an entire row 
 from Cassandra to client, inverting every column, and sending the data back 
 to create the inversed index.
 
 -- 
 Filippo Diotalevi
 
 
 On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote:
 
 
 If I understand you've got a data model which looks like this: 
 
 CF Events: 
 row1: { event1: 1050, event2: 1200, event3: 830, ... } 
 
 You can't query on column values but you can build every day a ranking in 
 a dedicated CF by iterating over events: 
 
 create column family Ranking 
 with comparator = 'LongType(reversed=true)'   
 ... 
 
 CF Ranking: 
 rank: { 1200: event2, 1050: event1, 830: event3, ... } 
 
 Then you can make a top ten or whatever you want because counter values 
 will be sorted. 
 
 
 Filippo Diotalevi fili...@ntoklo.com a écrit sur 21/05/2012 16:59:43 :
 
  Hi, 
  I'm trying to understand what's the best design for a simple 
  ranking use cases. 
  I have, in a row, a good number (10k - a few 100K) of counters; each
  one is counting the occurrence of an event. At the end of day, I 
  want to create a ranking of the most occurred event. 
  
  What's the best approach to perform this task?  
  The brute force approach of retrieving the row and ordering it 
  doesn't work well (the call usually goes timeout, especially is 
  Cassandra is also under load); I also don't know in advance the full
  set of event names (column names), so it's difficult to slice the get 
  call. 
  
  Is there any trick to solve this problem? Maybe a way to retrieve 
  the row ordering for counter values? 
  
  Thanks, 
  -- 
  Filippo Diotalevi