Re: Nodes go down periodically
"Is it only one node at a time that goes down, and at widely dispersed times?" It is a two node cluster so both nodes consider the other node down at the same time. These are the times the latest few days: INFO [GossipTasks:1] 2016-02-19 05:06:21,087 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-19 14:33:38,424 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-20 07:21:25,626 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-20 11:34:46,766 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-21 08:00:07,518 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-21 10:36:58,788 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-22 07:10:40,304 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-23 08:59:05,392 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN INFO [GossipTasks:1] 2016-02-23 12:22:59,562 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN 2016-02-23 18:01 GMT+01:00 daemeon reiydelle <daeme...@gmail.com>: > If you can, do a few (short, maybe 10m records, delete the default schema > between executions) run of Cassandra Stress test against your production > cluster (replication=3, force quorum to 3). Look for latency max in the 10s > of SECONDS. If your devops team is running a monitoring tool that looks at > the network, look for timeout/retries/errors/lost packets, etc. during the > run (worst case you need to do netstats runs against the relevant nic e.g. > every 10 seconds on the CassStress node, look for jumps in this count (if > monitoring is enabled, look at the monitor's results for ALL of your nodes. > At least one is having some issues. > > > *...* > > > > *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 > <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> The reality of modern distributed systems is that connectivity between >> nodes is never guaranteed and distributed software must be able to cope >> with occasional absence of connectivity. GC and network connectivity are >> the two issues that a lot of us are most familiar with. There may be others >> - but most technical problems on a node would be clearly logged on that >> node. If you see a lapse of connectivity no more than once or twice a day, >> consider yourselves lucky. >> >> Is it only one node at a time that goes down, and at widely dispersed >> times? >> >> How many nodes? >> >> -- Jack Krupansky >> >> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson < >> samuelsson.j...@gmail.com> wrote: >> >>> Hi, >>> >>> Version is 2.0.17. >>> Yes, these are VMs in the cloud though I'm fairly certain they are on a >>> LAN rather than WAN. They are both in the same data centre physically. The >>> phi_convict_threshold is set to default. I'd rather find the root cause of >>> the problem than just hiding it by not convicting a node if it isn't >>> responding though. If pings are <2 ms without a single ping missed in >>> several days, I highly doubt that network is the reason for the downtime. >>> >>> Best regards, >>> Joel >>> >>> 2016-02-23 16:39 GMT+01:00 <sean_r_dur...@homedepot.com>: >>> >>>> You didn’t mention version, but I saw this kind of thing very often in >>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs? >>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take >>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to >>>> increase it to reduce the UP/DOWN flapping behavior. >>>> >>>> >>>> >>>> >>>> >>>> Sean Durity >>>> >>>> >>>> >>>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com] >>>> *Sent:* Tuesday, February 23, 2016 9:41 AM >>>> *To:* user@cassandra.apache.org >>>> *Subject:* Re: Nodes go down periodically >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> Thanks for your reply. >>>> >>>> >>>> >>>> I have debug logging on an
Re: Nodes go down periodically
If you can, do a few (short, maybe 10m records, delete the default schema between executions) run of Cassandra Stress test against your production cluster (replication=3, force quorum to 3). Look for latency max in the 10s of SECONDS. If your devops team is running a monitoring tool that looks at the network, look for timeout/retries/errors/lost packets, etc. during the run (worst case you need to do netstats runs against the relevant nic e.g. every 10 seconds on the CassStress node, look for jumps in this count (if monitoring is enabled, look at the monitor's results for ALL of your nodes. At least one is having some issues. *...* *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > The reality of modern distributed systems is that connectivity between > nodes is never guaranteed and distributed software must be able to cope > with occasional absence of connectivity. GC and network connectivity are > the two issues that a lot of us are most familiar with. There may be others > - but most technical problems on a node would be clearly logged on that > node. If you see a lapse of connectivity no more than once or twice a day, > consider yourselves lucky. > > Is it only one node at a time that goes down, and at widely dispersed > times? > > How many nodes? > > -- Jack Krupansky > > On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson < > samuelsson.j...@gmail.com> wrote: > >> Hi, >> >> Version is 2.0.17. >> Yes, these are VMs in the cloud though I'm fairly certain they are on a >> LAN rather than WAN. They are both in the same data centre physically. The >> phi_convict_threshold is set to default. I'd rather find the root cause of >> the problem than just hiding it by not convicting a node if it isn't >> responding though. If pings are <2 ms without a single ping missed in >> several days, I highly doubt that network is the reason for the downtime. >> >> Best regards, >> Joel >> >> 2016-02-23 16:39 GMT+01:00 <sean_r_dur...@homedepot.com>: >> >>> You didn’t mention version, but I saw this kind of thing very often in >>> the 1.1 line. Often this is connected to network flakiness. Are these VMs? >>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take >>> a look at the phi_convict_threshold in c assandra.yaml. You may need to >>> increase it to reduce the UP/DOWN flapping behavior. >>> >>> >>> >>> >>> >>> Sean Durity >>> >>> >>> >>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com] >>> *Sent:* Tuesday, February 23, 2016 9:41 AM >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: Nodes go down periodically >>> >>> >>> >>> Hi, >>> >>> >>> >>> Thanks for your reply. >>> >>> >>> >>> I have debug logging on and see no GC pauses that are that long. GC >>> pauses are all well below 1s and 99 times out of 100 below 100ms. >>> >>> Do I need to enable GC log options to see the pauses? >>> >>> I see plenty of these lines: >>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line >>> 118) GC for ParNew: 24 ms for 1 collections >>> >>> as well as a few CMS GC log lines. >>> >>> >>> >>> Best regards, >>> >>> Joel >>> >>> >>> >>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com>: >>> >>> Hi, >>> >>> >>> >>> Those are probably GC pauses. Memory tuning is probably needed. Check >>> the parameters that you already have customised if they make sense. >>> >>> >>> >>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html >>> >>> >>> >>> Hannu >>> >>> >>> >>> >>> >>> On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com> >>> wrote: >>> >>> >>> >>> Our nodes go down periodically, around 1-2 times each day. Downtime is >>> from <1 second to 30 or so seconds. >>> >>> >>> >>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) >>> InetAddress /109.74.13.67 is now DOWN >>> >>> INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java >>> (line 978) InetAddress /109.74.13.67 is now UP &g
Re: Nodes go down periodically
The reality of modern distributed systems is that connectivity between nodes is never guaranteed and distributed software must be able to cope with occasional absence of connectivity. GC and network connectivity are the two issues that a lot of us are most familiar with. There may be others - but most technical problems on a node would be clearly logged on that node. If you see a lapse of connectivity no more than once or twice a day, consider yourselves lucky. Is it only one node at a time that goes down, and at widely dispersed times? How many nodes? -- Jack Krupansky On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <samuelsson.j...@gmail.com > wrote: > Hi, > > Version is 2.0.17. > Yes, these are VMs in the cloud though I'm fairly certain they are on a > LAN rather than WAN. They are both in the same data centre physically. The > phi_convict_threshold is set to default. I'd rather find the root cause of > the problem than just hiding it by not convicting a node if it isn't > responding though. If pings are <2 ms without a single ping missed in > several days, I highly doubt that network is the reason for the downtime. > > Best regards, > Joel > > 2016-02-23 16:39 GMT+01:00 <sean_r_dur...@homedepot.com>: > >> You didn’t mention version, but I saw this kind of thing very often in >> the 1.1 line. Often this is connected to network flakiness. Are these VMs? >> In the cloud? Connected over a WAN? You mention that ping seems fine. Take >> a look at the phi_convict_threshold in c assandra.yaml. You may need to >> increase it to reduce the UP/DOWN flapping behavior. >> >> >> >> >> >> Sean Durity >> >> >> >> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com] >> *Sent:* Tuesday, February 23, 2016 9:41 AM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Nodes go down periodically >> >> >> >> Hi, >> >> >> >> Thanks for your reply. >> >> >> >> I have debug logging on and see no GC pauses that are that long. GC >> pauses are all well below 1s and 99 times out of 100 below 100ms. >> >> Do I need to enable GC log options to see the pauses? >> >> I see plenty of these lines: >> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line >> 118) GC for ParNew: 24 ms for 1 collections >> >> as well as a few CMS GC log lines. >> >> >> >> Best regards, >> >> Joel >> >> >> >> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com>: >> >> Hi, >> >> >> >> Those are probably GC pauses. Memory tuning is probably needed. Check the >> parameters that you already have customised if they make sense. >> >> >> >> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html >> >> >> >> Hannu >> >> >> >> >> >> On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com> >> wrote: >> >> >> >> Our nodes go down periodically, around 1-2 times each day. Downtime is >> from <1 second to 30 or so seconds. >> >> >> >> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) >> InetAddress /109.74.13.67 is now DOWN >> >> INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java >> (line 978) InetAddress /109.74.13.67 is now UP >> >> >> >> I find nothing odd in the logs around the same time. I logged a ping with >> timestamp and checked during the same time and saw nothing weird (ping is >> less than 2ms at all times). >> >> >> >> Does anyone have any suggestions as to why this might happen? >> >> >> >> Best regards, >> Joel >> >> >> >> >> >> -- >> >> The information in this Internet Email is confidential and may be legally >> privileged. It is intended solely for the addressee. Access to this Email >> by anyone else is unauthorized. If you are not the intended recipient, any >> disclosure, copying, distribution or any action taken or omitted to be >> taken in reliance on it, is prohibited and may be unlawful. When addressed >> to our clients any opinions or advice contained in this Email are subject >> to the terms and conditions expressed in any applicable governing The Home >> Depot terms of business or client engagement letter. The Home Depot >> disclaims all responsibility and liability for the accuracy and content of >> this attachment and for any damages or losses arising from any >> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >> items of a destructive nature, which may be contained in this attachment >> and shall not be liable for direct, indirect, consequential or special >> damages in connection with this e-mail message or its attachment. >> > >
Re: Nodes go down periodically
Hi, Version is 2.0.17. Yes, these are VMs in the cloud though I'm fairly certain they are on a LAN rather than WAN. They are both in the same data centre physically. The phi_convict_threshold is set to default. I'd rather find the root cause of the problem than just hiding it by not convicting a node if it isn't responding though. If pings are <2 ms without a single ping missed in several days, I highly doubt that network is the reason for the downtime. Best regards, Joel 2016-02-23 16:39 GMT+01:00 <sean_r_dur...@homedepot.com>: > You didn’t mention version, but I saw this kind of thing very often in the > 1.1 line. Often this is connected to network flakiness. Are these VMs? In > the cloud? Connected over a WAN? You mention that ping seems fine. Take a > look at the phi_convict_threshold in c assandra.yaml. You may need to > increase it to reduce the UP/DOWN flapping behavior. > > > > > > Sean Durity > > > > *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com] > *Sent:* Tuesday, February 23, 2016 9:41 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Nodes go down periodically > > > > Hi, > > > > Thanks for your reply. > > > > I have debug logging on and see no GC pauses that are that long. GC pauses > are all well below 1s and 99 times out of 100 below 100ms. > > Do I need to enable GC log options to see the pauses? > > I see plenty of these lines: > DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line > 118) GC for ParNew: 24 ms for 1 collections > > as well as a few CMS GC log lines. > > > > Best regards, > > Joel > > > > 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com>: > > Hi, > > > > Those are probably GC pauses. Memory tuning is probably needed. Check the > parameters that you already have customised if they make sense. > > > > http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html > > > > Hannu > > > > > > On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com> > wrote: > > > > Our nodes go down periodically, around 1-2 times each day. Downtime is > from <1 second to 30 or so seconds. > > > > INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) > InetAddress /109.74.13.67 is now DOWN > > INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java > (line 978) InetAddress /109.74.13.67 is now UP > > > > I find nothing odd in the logs around the same time. I logged a ping with > timestamp and checked during the same time and saw nothing weird (ping is > less than 2ms at all times). > > > > Does anyone have any suggestions as to why this might happen? > > > > Best regards, > Joel > > > > > > -- > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. >
RE: Nodes go down periodically
You didn’t mention version, but I saw this kind of thing very often in the 1.1 line. Often this is connected to network flakiness. Are these VMs? In the cloud? Connected over a WAN? You mention that ping seems fine. Take a look at the phi_convict_threshold in c assandra.yaml. You may need to increase it to reduce the UP/DOWN flapping behavior. Sean Durity From: Joel Samuelsson [mailto:samuelsson.j...@gmail.com] Sent: Tuesday, February 23, 2016 9:41 AM To: user@cassandra.apache.org Subject: Re: Nodes go down periodically Hi, Thanks for your reply. I have debug logging on and see no GC pauses that are that long. GC pauses are all well below 1s and 99 times out of 100 below 100ms. Do I need to enable GC log options to see the pauses? I see plenty of these lines: DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line 118) GC for ParNew: 24 ms for 1 collections as well as a few CMS GC log lines. Best regards, Joel 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com<mailto:hkro...@gmail.com>>: Hi, Those are probably GC pauses. Memory tuning is probably needed. Check the parameters that you already have customised if they make sense. http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html Hannu On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com<mailto:samuelsson.j...@gmail.com>> wrote: Our nodes go down periodically, around 1-2 times each day. Downtime is from <1 second to 30 or so seconds. INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) InetAddress /109.74.13.67<http://109.74.13.67/> is now DOWN INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java (line 978) InetAddress /109.74.13.67<http://109.74.13.67/> is now UP I find nothing odd in the logs around the same time. I logged a ping with timestamp and checked during the same time and saw nothing weird (ping is less than 2ms at all times). Does anyone have any suggestions as to why this might happen? Best regards, Joel The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
Re: Nodes go down periodically
Hi, Thanks for your reply. I have debug logging on and see no GC pauses that are that long. GC pauses are all well below 1s and 99 times out of 100 below 100ms. Do I need to enable GC log options to see the pauses? I see plenty of these lines: DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line 118) GC for ParNew: 24 ms for 1 collections as well as a few CMS GC log lines. Best regards, Joel 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com>: > Hi, > > Those are probably GC pauses. Memory tuning is probably needed. Check the > parameters that you already have customised if they make sense. > > http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html > > Hannu > > > On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com> > wrote: > > Our nodes go down periodically, around 1-2 times each day. Downtime is > from <1 second to 30 or so seconds. > > INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) > InetAddress /109.74.13.67 is now DOWN > INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java > (line 978) InetAddress /109.74.13.67 is now UP > > I find nothing odd in the logs around the same time. I logged a ping with > timestamp and checked during the same time and saw nothing weird (ping is > less than 2ms at all times). > > Does anyone have any suggestions as to why this might happen? > > Best regards, > Joel > > >
Re: Nodes go down periodically
Hi, Those are probably GC pauses. Memory tuning is probably needed. Check the parameters that you already have customised if they make sense. http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html <http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html> Hannu > On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com> wrote: > > Our nodes go down periodically, around 1-2 times each day. Downtime is from > <1 second to 30 or so seconds. > > INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) > InetAddress /109.74.13.67 <http://109.74.13.67/> is now DOWN > INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java (line > 978) InetAddress /109.74.13.67 <http://109.74.13.67/> is now UP > > I find nothing odd in the logs around the same time. I logged a ping with > timestamp and checked during the same time and saw nothing weird (ping is > less than 2ms at all times). > > Does anyone have any suggestions as to why this might happen? > > Best regards, > Joel
Nodes go down periodically
Our nodes go down periodically, around 1-2 times each day. Downtime is from <1 second to 30 or so seconds. INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) InetAddress /109.74.13.67 is now DOWN INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java (line 978) InetAddress /109.74.13.67 is now UP I find nothing odd in the logs around the same time. I logged a ping with timestamp and checked during the same time and saw nothing weird (ping is less than 2ms at all times). Does anyone have any suggestions as to why this might happen? Best regards, Joel