Re: Node Dead/Up

2012-10-24 Thread Jason Wee
On Wed, Oct 24, 2012 at 2:32 PM, aaron morton aa...@thelastpickle.comwrote:

  I don't see errors in the logs, but I do see
 a lot of dropped mutations and reads. Any correlation?

 Yes.
 The dropped messages mean the server is overloaded.

 +1 . Been there..overloaded system normally produce frequent dropped
mutation and/or reads. Run nodetool tpstats and will reveal many indicator.


 Look for log messages from the GCInspector in
 /var/log/cassandra/system.log and/or an overloaded IO system see
 http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 24/10/2012, at 1:27 PM, Jason Hill jasonhill...@gmail.com wrote:

 thanks for the replies.

 I'll check the load on the node that is reported as DOWN/UP. At first
 glace it does not appear to be overloaded. But, I will dig in deeper,
 is there a specific indicator on an ubuntu server that would be useful
 to me?

 Also, I didn't make it clear, but in my original post, there are logs
 from 2 different nodes: 10.21 and 10.25. They are each reporting that
 the other is DOWN/UP at the same time. Would that still point me to
 the suggestions you made? I don't see errors in the logs, but I do see
 a lot of dropped mutations and reads. Any correlation?

 thanks again,
 Jason

 On Tue, Oct 23, 2012 at 12:49 AM, aaron morton aa...@thelastpickle.com
 wrote:

 check 10.50.10.21 for what is the system load.

 +1

 And take a look in the logs on 10.21.

 10.21 is being seen as down by the other nodes. it could be:

 * 10.21 failing to gossip fast enough, say by being overloaded to in long
 ParNew GC pauses.
 * This node failing to process gossip fast , say by being overloaded to in
 long ParNew GC pauses.
 * Problems with the tubes used to connect the nodes.

 (It's probably the first one.)

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 23/10/2012, at 8:19 PM, Jason Wee peich...@gmail.com wrote:

 check 10.50.10.21 for what is the system load.

 On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com
 wrote:


 Hello,

 I'm on version 1.0.11.

 I'm seeing this in my system log with occasional frequency:

 INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
 InetAddress /10.50.10.21 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
 InetAddress /10.50.10.21 is now UP


 INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
 (line 228) Streaming to /10.50.10.25 --this line included for context
 INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
 InetAddress /10.50.10.25 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
 InetAddress /10.50.10.25 is now UP
 INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
 AntiEntropyService.java (line 233) [repair
 #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree
 to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included
 for context

 What is this telling me? Is my network dropping for less than a
 second? Are my nodes really dead and then up? Can someone shed some
 light on this for me?

 cheers,
 Jason








Re: What does ReadRepair exactly do?

2012-10-24 Thread Hiller, Dean
Keep in mind, returning the older version is usually fine.  Just imagine
if your user clicked write 1 ms before, then the new version might be
returned.  If he gets the older version and refreshes the page, he gets
the newer version.  Same with an automated program as wellŠ.in general it
is okay to get the older or newer value.  If you are reading 2 rows
however instead of one, that may change.

Dean

On 10/23/12 7:04 PM, shankarpnsn shankarp...@gmail.com wrote:

manuzhang wrote
 why repair again? We block until the consistency constraint is met. Then
 the latest version is returned and repair is done asynchronously if any
 mismatch. We may retry read if fewer columns than required are returned.

Just to make sure I understand you correct, considering the case when a
read
repair is in flight and a subsequent write affects one or more of the
replicas that was scheduled to received the repair mutations. In this
case,
are you saying that we return the older version to the user rather than
the
latest version that was effected by the write ?



--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583355.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at
Nabble.com.



Re: Hinted Handoff runs every ten minutes

2012-10-24 Thread Brandon Williams
On Sun, Oct 21, 2012 at 6:44 PM, aaron morton aa...@thelastpickle.com wrote:
 I *think* this may be ghost rows which have not being compacted.

You would be correct in the case of 1.0.8:
https://issues.apache.org/jira/browse/CASSANDRA-3955

-Brandon


Re: Hinted Handoff runs every ten minutes

2012-10-24 Thread Tamar Fraenkel
Is there a walk around other than upgrade?
Thanks,
*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Wed, Oct 24, 2012 at 1:56 PM, Brandon Williams dri...@gmail.com wrote:

 On Sun, Oct 21, 2012 at 6:44 PM, aaron morton aa...@thelastpickle.com
 wrote:
  I *think* this may be ghost rows which have not being compacted.

 You would be correct in the case of 1.0.8:
 https://issues.apache.org/jira/browse/CASSANDRA-3955

 -Brandon

tokLogo.png

Re: What does ReadRepair exactly do?

2012-10-24 Thread shankarpnsn
Hiller, Dean wrote
 in general it is okay to get the older or newer value.  If you are reading
 2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3 replicas
R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should trigger
another round of repair that tries to reach the consistency constraint with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something here. 



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583366.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Strange row expiration behavior

2012-10-24 Thread Stephen Mullins
That worked perfectly, inserting another row after the first compaction,
then flushing and compacting again triggered the empty rows to be removed.
Thanks for your help and for clarifying the gcBefore point Aaron.

Stephen

On Tue, Oct 23, 2012 at 4:47 PM, aaron morton aa...@thelastpickle.comwrote:

 In the first example, I am running compaction at step 7 through nodetool,

 Sorry missed that.


1. insert a couple rows with ttl=5 (again, just a small number)
2.

 ExpiringColumn's are only purged if their TTL has expired AND their
 absolute (node local) expiry time occurred before the current gcBefore
 time.
 This may have explained why the columns were not purged in the first
 compaction.

 Can you try your first steps again. And then for the second set of steps
 add a new row, flush, compact. The expired rows should be removed.

 I don't have to manually delete empty rows after the columns expire. .

 Rows are automatically purged when all columns are purged.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 24/10/2012, at 3:05 AM, Stephen Mullins smull...@thebrighttag.com
 wrote:

 Thanks Aaron, my reply is inline below:

 On Tue, Oct 23, 2012 at 2:38 AM, aaron morton aa...@thelastpickle.comwrote:

 Performing these steps results in the rows still being present using 
 *cassandra-cli
 list*.

 I assume you are saying the row key is listed without any columns. aka a
 ghost row.

 Correct.


  What gets really odd is if I add these steps it works

 That's working as designed.

 gc_grace_seconds does not specify when tombstones must be purged, rather
 it specifies the minimum duration the tombstone must be stored. It's really
 saying if you compact this column X seconds after the delete you can purge
 the tombstone.

 Minor / automatic compaction will kick in if there are (by default) 4
 SSTables of the same size. And will only purge tombstones if all fragments
 of the row exists in the SSTables being compaction.

 Major / manual compaction compacts all the sstables, and so purges the
 tombstones IF gc_grace_seconds has expired.

 In your first example compaction had not run so the tombstones stayed on
 disk. In the second the major compaction purged expired tombstones.

 In the first example, I am running compaction at step 7 through nodetool,
 after gc_grace_seconds has expired. Additionally, if I do not perform the
 manual delete of the row in the second example, the ghost rows are not
 cleaned up. I want to know that in our production environment, I don't have
 to manually delete empty rows after the columns expire. But I can't get an
 example working to that effect.


 Hope that helps.

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com
 wrote:

 Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12.
 I'm trying to test removing rows after all columns have expired. I've read
 the following:
 http://wiki.apache.org/cassandra/DistributedDeletes
 http://wiki.apache.org/cassandra/MemtableSSTable
 https://issues.apache.org/jira/browse/CASSANDRA-2795

 And came up with a test to demonstrate the empty row removal that does
 the following:

1. create a keyspace
2. create a column family with gc_seconds=10 (arbitrary small number)
3. insert a couple rows with ttl=5 (again, just a small number)
4. use nodetool to flush the column family
5. sleep 10 seconds
6. ensure the columns are removed with *cassandra-cli list *
7. use nodetool to compact the keyspace

 Performing these steps results in the rows still being present using 
 *cassandra-cli
 list*. What gets really odd is if I add these steps it works:

1. sleep 5 seconds
2. use cassandra-cli to *del mycf[arow]*
3. use nodetool to flush the column family
4. use nodetool to compact the keyspace

 I don't understand why the first set of steps (1-7) don't work to remove
 the empty row, nor do I understand why the explicit row delete somehow
 makes this work. I have all this in a script that I could attach if that's
 appropriate. Is there something wrong with the steps that I have?

 Thanks,
 Stephen







Re: What does ReadRepair exactly do?

2012-10-24 Thread Hiller, Dean
The user will meet the required consistency unless you encounter some kind
of bug in cassandra.  You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved.  If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak. 

In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value.  If it read from R2 just after the write
was applied, it gets the new value.  BOTH of these met the consistency
constraint.  A better example to clear this up may be the following...  If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right?  And it met the consistency level, right?  NOW, what
about if the write is 1ms later?  What if it the right is .1ms later?
It still met the consistency level, right?  If it is .1ms before, you
get the new value as it repairs first with the new node.

It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).

I hope that clears it up

Later,
Dean

On 10/24/12 8:02 AM, shankarpnsn shankarp...@gmail.com wrote:

Hiller, Dean wrote
 in general it is okay to get the older or newer value.  If you are
reading
 2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3
replicas
R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a
more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should
trigger
another round of repair that tries to reach the consistency constraint
with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something
here. 



--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583366.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at
Nabble.com.



Re: What does ReadRepair exactly do?

2012-10-24 Thread Manu Zhang
And we don't send read request to all of the three replicas (R1, R2, R3) if
CL=QUOROM; just 2 of them depending on proximity

On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 The user will meet the required consistency unless you encounter some kind
 of bug in cassandra.  You will either get the older value or the newer
 value. If you read quorum, and maybe a write CL=1 just happened, you may
 get the older or new value depending on if the node that received the
 write was involved.  If you read quorum and your wrote CL=QUOROM, then you
 may get the newer value or the older value depending on who gets their
 first so to speak.

 In your scenario, if the read repair read from R2 just before the write is
 applied, you get the old value.  If it read from R2 just after the write
 was applied, it gets the new value.  BOTH of these met the consistency
 constraint.  A better example to clear this up may be the following...  If
 you read a value at CL=QUOROM, and you have a write 20ms later, you get
 the old value, right?  And it met the consistency level, right?  NOW, what
 about if the write is 1ms later?  What if it the right is .1ms later?
 It still met the consistency level, right?  If it is .1ms before, you
 get the new value as it repairs first with the new node.

 It is just when programming, your read may get the newer value or older
 value and generally if you write the code in a way that works, this
 concept works out great in most cases(in some cases, you need to think a
 bit differently and solve it other ways).

 I hope that clears it up

 Later,
 Dean

 On 10/24/12 8:02 AM, shankarpnsn shankarp...@gmail.com wrote:

 Hiller, Dean wrote
  in general it is okay to get the older or newer value.  If you are
 reading
  2 rows however instead of one, that may change.
 
 This is certainly interesting, as it could mean that the user could see a
 value that never met the required consistency. For instance with 3
 replicas
 R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read
 (becomes the coordinator) - notices a conflict with R2 (assume R1 has a
 more
 recent value) and initiates a read repair with its value. Meanwhile R2 and
 R3 have seen two different writes with newer values than what was computed
 by the read repair. If R1 were to respond back to the user with the value
 that was computed at the time of read repair, wouldn't it be a value that
 never met the consistency constraint? I was thinking if this should
 trigger
 another round of repair that tries to reach the consistency constraint
 with
 a newer value or time-out, which is the expected case when you don't meet
 the required consistency. Please let me know if I'm missing something
 here.
 
 
 
 --
 View this message in context:
 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
 -ReadRepair-exactly-do-tp7583261p7583366.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.




Re: What does ReadRepair exactly do?

2012-10-24 Thread Hiller, Dean
I guess one more thing is I completely ignore your second write mainly because 
I assume it comes after we already read so your let's say you current state is

node1 = val1 node2 = val1 node3 = val1

You do a write quorom of val=2 which is IN the middle!!!

node1 = val1 node2 = val2 node3 = val1  (NOTICE the write is not complete yet)

If you read from node1 and node3, you get val1.  If you read from node1 and 
node2, you get val2 as a read repair will happen.

Ie. You always get the older value or newer value.

If you have two writes come in like so

node1 = val1 node2 = val2 and node3= val3

Well, I think you can figure it out when you do a read ;).  If your read quorum 
reads from node1 and node3 , you get val3, etc. etc.

This is basically how it works….If your scenario is a web page, a user simply 
hits the refresh button and sees the values changing.

Later,
Dean

From: Manu Zhang owenzhang1...@gmail.commailto:owenzhang1...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday, October 24, 2012 8:26 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: What does ReadRepair exactly do?

And we don't send read request to all of the three replicas (R1, R2, R3) if 
CL=QUOROM; just 2 of them depending on proximity

On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote:
The user will meet the required consistency unless you encounter some kind
of bug in cassandra.  You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved.  If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak.

In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value.  If it read from R2 just after the write
was applied, it gets the new value.  BOTH of these met the consistency
constraint.  A better example to clear this up may be the following...  If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right?  And it met the consistency level, right?  NOW, what
about if the write is 1ms later?  What if it the right is .1ms later?
It still met the consistency level, right?  If it is .1ms before, you
get the new value as it repairs first with the new node.

It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).

I hope that clears it up

Later,
Dean

On 10/24/12 8:02 AM, shankarpnsn 
shankarp...@gmail.commailto:shankarp...@gmail.com wrote:

Hiller, Dean wrote
 in general it is okay to get the older or newer value.  If you are
reading
 2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3
replicas
R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a
more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should
trigger
another round of repair that tries to reach the consistency constraint
with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something
here.



--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583366.html
Sent from the 
cassandra-u...@incubator.apache.orgmailto:cassandra-u...@incubator.apache.org
 mailing list archive at
Nabble.com.




Re: What does ReadRepair exactly do?

2012-10-24 Thread Manu Zhang
oh, it would clarity a lot if you go to read the source code; the method is
o.a.c.service.StorageProxy.fetchRows if I remember it correctly

On Wed, Oct 24, 2012 at 10:26 PM, Manu Zhang owenzhang1...@gmail.comwrote:

 And we don't send read request to all of the three replicas (R1, R2, R3)
 if CL=QUOROM; just 2 of them depending on proximity


 On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean dean.hil...@nrel.govwrote:

 The user will meet the required consistency unless you encounter some kind
 of bug in cassandra.  You will either get the older value or the newer
 value. If you read quorum, and maybe a write CL=1 just happened, you may
 get the older or new value depending on if the node that received the
 write was involved.  If you read quorum and your wrote CL=QUOROM, then you
 may get the newer value or the older value depending on who gets their
 first so to speak.

 In your scenario, if the read repair read from R2 just before the write is
 applied, you get the old value.  If it read from R2 just after the write
 was applied, it gets the new value.  BOTH of these met the consistency
 constraint.  A better example to clear this up may be the following...  If
 you read a value at CL=QUOROM, and you have a write 20ms later, you get
 the old value, right?  And it met the consistency level, right?  NOW, what
 about if the write is 1ms later?  What if it the right is .1ms later?
 It still met the consistency level, right?  If it is .1ms before, you
 get the new value as it repairs first with the new node.

 It is just when programming, your read may get the newer value or older
 value and generally if you write the code in a way that works, this
 concept works out great in most cases(in some cases, you need to think a
 bit differently and solve it other ways).

 I hope that clears it up

 Later,
 Dean

 On 10/24/12 8:02 AM, shankarpnsn shankarp...@gmail.com wrote:

 Hiller, Dean wrote
  in general it is okay to get the older or newer value.  If you are
 reading
  2 rows however instead of one, that may change.
 
 This is certainly interesting, as it could mean that the user could see a
 value that never met the required consistency. For instance with 3
 replicas
 R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read
 (becomes the coordinator) - notices a conflict with R2 (assume R1 has a
 more
 recent value) and initiates a read repair with its value. Meanwhile R2
 and
 R3 have seen two different writes with newer values than what was
 computed
 by the read repair. If R1 were to respond back to the user with the value
 that was computed at the time of read repair, wouldn't it be a value that
 never met the consistency constraint? I was thinking if this should
 trigger
 another round of repair that tries to reach the consistency constraint
 with
 a newer value or time-out, which is the expected case when you don't meet
 the required consistency. Please let me know if I'm missing something
 here.
 
 
 
 --
 View this message in context:
 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
 -ReadRepair-exactly-do-tp7583261p7583366.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive
 at
 Nabble.com.





Re: What does ReadRepair exactly do?

2012-10-24 Thread shankarpnsn
Hiller, Dean wrote
 I guess one more thing is I completely ignore your second write mainly
 because I assume it comes after we already read so your let's say you
 current state is
 
 node1 = val1 node2 = val1 node3 = val1
 
 You do a write quorom of val=2 which is IN the middle!!!
 
 node1 = val1 node2 = val2 node3 = val1  (NOTICE the write is not complete
 yet)
 
 If you read from node1 and node3, you get val1.  If you read from node1
 and node2, you get val2 as a read repair will happen.
 
 Ie. You always get the older value or newer value.
 
 If you have two writes come in like so
 
 node1 = val1 node2 = val2 and node3= val3
 
 Well, I think you can figure it out when you do a read ;).  If your read
 quorum reads from node1 and node3 , you get val3, etc. etc.
 
 This is basically how it works….If your scenario is a web page, a user
 simply hits the refresh button and sees the values changing. I'm extending
 your example 
 
 Later,
 Dean

Thanks for the example Dean. This definitely clears things up when you have
an overlap between the read and the write, and one comes after the other.
I'm still missing, how read repairs behave. Just extending your example for
the following case: 

1. node1 = val1 node2 = val1 node3 = val1

2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)

3. Now with a read (R1) from node1 and node2, a read repair will be
initiated that needs to write val2 on node 1.  
node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not complete
yet)

4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
now arrives at node 1 but sees a newer value val4.
node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete, read
repair val2 not complete)

In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ? 


Zhang, Manu wrote
 And we don't send read request to all of the three replicas (R1, R2, R3)
 if CL=QUOROM; just 2 of them depending on proximity

Thanks Zhang. But, this again seems a little strange thing to do, since one
(say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
read failure while there are still enough number of replicas (R1 and R3)
live to satisfy a read. 



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583372.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: What does ReadRepair exactly do?

2012-10-24 Thread Hiller, Dean
Thanks Zhang. But, this again seems a little strange thing to do, since
one
(say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
read failure while there are still enough number of replicas (R1 and R3)
live to satisfy a read.


He means in the case where all 3 nodes are liveŠ.if a node is down,
naturally it redirects to the other node and still succeeds because it
found 2 nodes even with one node down(feel free to test this live though
!)


Thanks for the example Dean. This definitely clears things up when you
have
an overlap between the read and the write, and one comes after the other.
I'm still missing, how read repairs behave. Just extending your example
for
the following case:

1. node1 = val1 node2 = val1 node3 = val1

2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)

3. Now with a read (R1) from node1 and node2, a read repair will be
initiated that needs to write val2 on node 1.
node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not
complete
yet)

4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
now arrives at node 1 but sees a newer value val4.
node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete,
read
repair val2 not complete)

In this case, for read R1, the value val2 does not have a quorum. Would
read
R1 return val2 or val4 ?

 
At this point as Manu suggests, you need to look at the code but most
likely what happens is they lock that row, receive the write in memory(ie.
Not losing it) and return to client, caching it so as soon as read-repair
is over, it will write that next value.  Ie. Your client would receive
val2 and val4 would be the value in the database right after you received
val2.  Ie. When a client interacts with cassandra and you have tons of
writes to a row, val1, val2, val3, val4 in a short time period, just like
a normal database, your client may get one of those 4 values depending on
here the read gets inserted in the order of the writesŠsame as a normal
RDBMS.  The only thing you don't have is the atomic nature with other rows.

NOTICE: they would not have to cache val4 very long, and if a newer write
came in, they would just replace it with that newer val and cache that one
instead so it would not be a queueŠbut this is all just a guessŠread the
code if you really want to know.



Zhang, Manu wrote
 And we don't send read request to all of the three replicas (R1, R2, R3)
 if CL=QUOROM; just 2 of them depending on proximity





--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583372.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at
Nabble.com.



get_paged_slice with SlicePredicate

2012-10-24 Thread Scott Fines
Hello all,

I'm playing around with using the get_paged_slice thrift call, and I noticed 
that it always was returning to me everything in the row--there's no mechanism 
for specifying a SlicePredicate. Was that intentional? If so, is there a 
different way that I can limit what I get back? I'd like to page over many 
rows, but only have data that is contained in a SlicePredicate be returned.

Thanks for your help,

Scott Fines



Re: Java 7 support?

2012-10-24 Thread Voodoo
And if you want a competitive edge, use it, tune it, take full advantage of the 
better version (7) and DON'T share. See the problem with not assigning this as 
a first class task for ASF team?

Sent from my iPad

On Oct 23, 2012, at 11:12 PM, Eric Evans eev...@acunu.com wrote:

 On Tue, Oct 16, 2012 at 7:54 PM, Rob Coli rc...@palominodb.com wrote:
 On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson
 edward.sargis...@globalrelay.net wrote:
 The Datastax documentation says that Java 7 is not recommended[1]. However,
 Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that
 comment?
 
 I've asked this approximate question here a few times, with no
 official response. The reason I ask is that in addition to Java 7 not
 being recommended, in Java 7 OpenJDK becomes the reference JVM, and
 OpenJDK is also not recommended.
 
 From other channels, I have conjectured that the current advice on
 Java 7 is it 'works' but is not as extensively tested (and definitely
 not as commonly deployed) as Java 6.
 
 That sounds about right.  The best way to change the status quo would
 be to use Java 7, report any bugs you find, and share your
 experiences.
 
 -- 
 Eric Evans
 Acunu | http://www.acunu.com | @acunu


Java 7 support?

2012-10-24 Thread Edward Capriolo
We have been using cassandra and java7 for months. No problems. A key
concept of java is portable binaries. There are sometimes wrinkles with
upgrades. If you hit one undo the upgrade and restart.

On Tuesday, October 23, 2012, Eric Evans eev...@acunu.com wrote:
 On Tue, Oct 16, 2012 at 7:54 PM, Rob Coli rc...@palominodb.com wrote:
 On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson
 edward.sargis...@globalrelay.net wrote:
 The Datastax documentation says that Java 7 is not recommended[1].
However,
 Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that
 comment?

 I've asked this approximate question here a few times, with no
 official response. The reason I ask is that in addition to Java 7 not
 being recommended, in Java 7 OpenJDK becomes the reference JVM, and
 OpenJDK is also not recommended.

 From other channels, I have conjectured that the current advice on
 Java 7 is it 'works' but is not as extensively tested (and definitely
 not as commonly deployed) as Java 6.

 That sounds about right.  The best way to change the status quo would
 be to use Java 7, report any bugs you find, and share your
 experiences.

 --
 Eric Evans
 Acunu | http://www.acunu.com | @acunu



Re: constant CMS GC using CPU time

2012-10-24 Thread Rob Coli
On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote:
 The nodes with the most data used the most memory.  All nodes are affected
 eventually not just one.  The GC was on-going even when the nodes were not
 compacting or running a heavy application load -- even when the main app was
 paused constant the GC continued.

This sounds very much like my heap is so consumed by (mostly) bloom
filters that I am in steady state GC thrash.

Do you have heap graphs which show a healthy sawtooth GC cycle which
then more or less flatlines?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Hinted Handoff storage inflation

2012-10-24 Thread Mattias Larsson

I'm testing various scenarios in a multi data center configuration. The setup 
is 10 Cassandra 1.1.5 nodes configured into two data centers, 5 nodes in each 
DC (RF DC1:3,DC2:3, write consistency LOCAL_QUORUM). I have a synthetic random 
data generator that I can run, and each run adds roughly 1GiB of data to each 
node per run,

DC  RackStatus State   LoadEffective-Ownership
  
DC1 RAC1Up Normal  1010.71 MB  60.00% 
DC2 RAC1Up Normal  1009.08 MB  60.00% 
DC1 RAC1Up Normal  1.01 GB 60.00% 
DC2 RAC1Up Normal  1 GB60.00% 
DC1 RAC1Up Normal  1.01 GB 60.00% 
DC2 RAC1Up Normal  1014.45 MB  60.00% 
DC1 RAC1Up Normal  1.01 GB 60.00% 
DC2 RAC1Up Normal  1.01 GB 60.00% 
DC1 RAC1Up Normal  1.01 GB 60.00% 
DC2 RAC1Up Normal  1.01 GB 60.00% 

Now, if I kill all the nodes in DC2, and run the data generator again, I would 
expect roughly 2GiB to be added to each node in DC1 (local replicas + hints to 
other data center), instead I get this:

DC  RackStatus State   LoadEffective-Ownership
  
DC1 RAC1Up Normal  17.56 GB60.00% 
DC2 RAC1Down   Normal  1009.08 MB  60.00% 
DC1 RAC1Up Normal  17.47 GB60.00% 
DC2 RAC1Down   Normal  1 GB60.00% 
DC1 RAC1Up Normal  17.22 GB60.00% 
DC2 RAC1Down   Normal  1014.45 MB  60.00% 
DC1 RAC1Up Normal  16.94 GB60.00% 
DC2 RAC1Down   Normal  1.01 GB 60.00% 
DC1 RAC1Up Normal  17.26 GB60.00% 
DC2 RAC1Down   Normal  1.01 GB 60.00% 

Checking the sstables on a node reveals this,

-bash-3.2$ du -hs HintsColumnFamily/
16G HintsColumnFamily/
-bash-3.2$

So it seems that what I would have expected to be 1GiB of hints is much larger 
in reality, a 15x-16x inflation. This has a huge impact on write performance as 
well.

If I bring DC2 up again, eventually the load will drop down and even out to 
2GiB across the entire cluster.

I'm wondering if this inflation is intended or if it is possibly a bug or 
something I'm doing wrong? Assuming this inflation is correct, what is the best 
way to deal with temporary connectivity issues with a second data center? Write 
performance is paramount in my use case. A 2x-3x overhead is doable, but not 
15x-16x.

Thanks,
/dml




Re: constant CMS GC using CPU time

2012-10-24 Thread Bryan Talbot
On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote:

 On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com
 wrote:
  The nodes with the most data used the most memory.  All nodes are
 affected
  eventually not just one.  The GC was on-going even when the nodes were
 not
  compacting or running a heavy application load -- even when the main app
 was
  paused constant the GC continued.

 This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.


Yes, I think that was at least part of the issue.




 Do you have heap graphs which show a healthy sawtooth GC cycle which
 then more or less flatlines?



I didn't save any graphs but that is what they would look like.  I was
using jstat to monitor gc activity.

-Bryan


Re: compression

2012-10-24 Thread aaron morton
Can you try restarting the node ? That would reload the CF Meta data and reset 
the compaction settings.

Sorry that's not very helpful but it's all I can think of for now. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 11:41 PM, Tamar Fraenkel ta...@tok-media.com wrote:

 Hi!
 I tried again, I see the scrub action in cassandra logs
  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,108 
 CompactionManager.java (line 476) Scrubbing 
 SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db')
  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,184 
 CompactionManager.java (line 658) Scrub of 
 SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db') 
 complete: 54 rows in new sstable and 0 empty (tombstoned) rows dropped
  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,185 
 CompactionManager.java (line 476) Scrubbing 
 SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db')
  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,914 
 CompactionManager.java (line 658) Scrub of 
 SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db') 
 complete: 7037 rows in new sstable and 0 empty (tombstoned) rows dropped
 
 I don't see any CompressionInfo.db files and compression ratio is still 0.0 
 on this node only, on other nodes it is almost 0.5...
 
 Any idea?
 
 Thanks,
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 
 On Wed, Sep 26, 2012 at 3:40 AM, aaron morton aa...@thelastpickle.com wrote:
 Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on 1 
 will be a good help with that. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 24/09/2012, at 10:31 PM, Tamar Fraenkel ta...@tok-media.com wrote:
 
 Hi!
 I ran 
 UPDATE COLUMN FAMILY cf_name WITH 
 compression_options={sstable_compression:SnappyCompressor, 
 chunk_length_kb:64};
 
 I then ran on all my nodes (3)
 sudo nodetool -h localhost scrub tok cf_name
 
 I have replication factor 3. The size of the data on disk was cut in half in 
 the first node and in the jmx I can see that indeed the compression ration 
 is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see that 
 compression ratio is 0 and the size of the files of disk stayed the same.
 
 In cli 
 
 ColumnFamily: cf_name
   Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
   Default column value validator: 
 org.apache.cassandra.db.marshal.UTF8Type
   Columns sorted by: 
 org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
   Row cache size / save period in seconds / keys to save : 0.0/0/all
   Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider
   Key cache size / save period in seconds: 20.0/14400
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 1.0
   Replicate on write: true
   Bloom Filter FP chance: default
   Built indexes: []
   Compaction Strategy: 
 org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
   Compression Options:
 chunk_length_kb: 64
 sstable_compression: 
 org.apache.cassandra.io.compress.SnappyCompressor
 
 Can anyone help?
 Thanks
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 
 On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel ta...@tok-media.com wrote:
 Thanks all, that helps. Will start with one - two CFs and let you know the 
 effect
 
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 
 On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 As well as your unlimited column names may all have the same prefix, right? 
 Like accounts.rowkey56, accounts.rowkey78, etc. etc.  so the accounts 
 gets a ton of compression then.
 
 Later,
 Dean
 
 From: Tyler Hobbs ty...@datastax.commailto:ty...@datastax.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, September 23, 2012 11:46 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: compression
 
  column metadata, you're still likely to get a reasonable amount of 
 compression.  This is especially true if there is some amount of repetition 
 in the column names, values, or TTLs in wide rows.  Compression will almost 
 always be beneficial unless you're already somehow 

Re: Hinted Handoff runs every ten minutes

2012-10-24 Thread aaron morton
Thanks. 
I thought it had been addressed so before but couldn't find the ticket. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 12:56 AM, Brandon Williams dri...@gmail.com wrote:

 On Sun, Oct 21, 2012 at 6:44 PM, aaron morton aa...@thelastpickle.com wrote:
 I *think* this may be ghost rows which have not being compacted.
 
 You would be correct in the case of 1.0.8:
 https://issues.apache.org/jira/browse/CASSANDRA-3955
 
 -Brandon



Re: Hinted Handoff storage inflation

2012-10-24 Thread aaron morton
Hints store the columns, row key, KS name and CF id(s) for each mutation to 
each node. Where an executed mutation will store the most recent columns 
collated with others under the same row key. So depending on the type of 
mutation hints will take up more space. 

The worse case would be lots of overwrites. After that writing a small amount 
of data to many rows would result in a lot of the serialised space being 
devoted to row keys, KS name and CF id.

16Gb is a lot though. What was the write workload like ?
You can get an estimate on the number of keys in the Hints CF using nodetool 
cfstats. Also some metrics in the JMX will tell you how many hints are stored. 

 This has a huge impact on write performance as well.
Yup. Hints are added to the same Mutation thread pool as normal mutations. They 
are processed async to the mutation request but they still take resources to 
store. 

You can adjust how long hints a collected for with max_hint_window_in_ms in the 
yaml file. 

How long did the test run for ? 


Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 11:26 AM, Mattias Larsson mlars...@yahoo-inc.com wrote:

 
 I'm testing various scenarios in a multi data center configuration. The setup 
 is 10 Cassandra 1.1.5 nodes configured into two data centers, 5 nodes in each 
 DC (RF DC1:3,DC2:3, write consistency LOCAL_QUORUM). I have a synthetic 
 random data generator that I can run, and each run adds roughly 1GiB of data 
 to each node per run,
 
 DC  RackStatus State   LoadEffective-Ownership
 
 DC1 RAC1Up Normal  1010.71 MB  60.00% 
 DC2 RAC1Up Normal  1009.08 MB  60.00% 
 DC1 RAC1Up Normal  1.01 GB 60.00% 
 DC2 RAC1Up Normal  1 GB60.00% 
 DC1 RAC1Up Normal  1.01 GB 60.00% 
 DC2 RAC1Up Normal  1014.45 MB  60.00% 
 DC1 RAC1Up Normal  1.01 GB 60.00% 
 DC2 RAC1Up Normal  1.01 GB 60.00% 
 DC1 RAC1Up Normal  1.01 GB 60.00% 
 DC2 RAC1Up Normal  1.01 GB 60.00% 
 
 Now, if I kill all the nodes in DC2, and run the data generator again, I 
 would expect roughly 2GiB to be added to each node in DC1 (local replicas + 
 hints to other data center), instead I get this:
 
 DC  RackStatus State   LoadEffective-Ownership
 
 DC1 RAC1Up Normal  17.56 GB60.00% 
 DC2 RAC1Down   Normal  1009.08 MB  60.00% 
 DC1 RAC1Up Normal  17.47 GB60.00% 
 DC2 RAC1Down   Normal  1 GB60.00% 
 DC1 RAC1Up Normal  17.22 GB60.00% 
 DC2 RAC1Down   Normal  1014.45 MB  60.00% 
 DC1 RAC1Up Normal  16.94 GB60.00% 
 DC2 RAC1Down   Normal  1.01 GB 60.00% 
 DC1 RAC1Up Normal  17.26 GB60.00% 
 DC2 RAC1Down   Normal  1.01 GB 60.00% 
 
 Checking the sstables on a node reveals this,
 
 -bash-3.2$ du -hs HintsColumnFamily/
 16G   HintsColumnFamily/
 -bash-3.2$
 
 So it seems that what I would have expected to be 1GiB of hints is much 
 larger in reality, a 15x-16x inflation. This has a huge impact on write 
 performance as well.
 
 If I bring DC2 up again, eventually the load will drop down and even out to 
 2GiB across the entire cluster.
 
 I'm wondering if this inflation is intended or if it is possibly a bug or 
 something I'm doing wrong? Assuming this inflation is correct, what is the 
 best way to deal with temporary connectivity issues with a second data 
 center? Write performance is paramount in my use case. A 2x-3x overhead is 
 doable, but not 15x-16x.
 
 Thanks,
 /dml
 
 



Re: Java 7 support?

2012-10-24 Thread Andrey V. Panov
Are you using openJDK or Oracle JDK? I know java7 should be based on
openJDK since 7, but still not sure.

On 25 October 2012 05:42, Edward Capriolo edlinuxg...@gmail.com wrote:

 We have been using cassandra and java7 for months. No problems. A key
 concept of java is portable binaries. There are sometimes wrinkles with
 upgrades. If you hit one undo the upgrade and restart.



Re: Java 7 support?

2012-10-24 Thread Peter Schuller
FWIW, we're using openjdk7 on most of our clusters. For those where we
are still on openjdk6, it's not because of an issue - just haven't
gotten to rolling out the upgrade yet.

We haven't had any issues that I recall with upgrading the JDK.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)