Re: electricity outage problem
Hi, we did full restart of the cluster but nodetool status still giving incoerent info from different nodes, some nodes appers UP from a node but appers DOWN from another, and in the log as is said still having the message "received an invalid gossip generation for peer /x.x.x.x" cassandra version is 2.1.2, we want to execute the purge operation as explained here https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_gossip_purge.html but we don't found the peers folder, should we do it via cql deleting the peers content? should we do it for all nodes? thanks 2016-01-12 17:42 GMT+01:00 Jack Krupansky: > Sometimes you may have to clear out the saved Gossip state: > > https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html > > Note the instruction about bringing up the seed nodes first. Normally seed > nodes are only relevant when initially joining a node to a cluster (and > then the Gossip state will be persisted locally), but if you clear te > persisted Gossip state the seed nodes will again be needed to find the rest > of the cluster. > > I'm not sure whether a power outage is the same as stopping and restarting > an instance (AWS) in terms of whether the restarted instance retains its > current public IP address. > > > > -- Jack Krupansky > > On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle > wrote: > >> This happens when there is insufficient time for nodes coming up to join >> a network. It takes a few seconds for a node to come up, e.g. your seed >> node. If you tell a node to join a cluster you can get this scenario >> because of high network utilization as well. I wait 90 seconds after the >> first (i.e. my first seed) node comes up to start the next one. Any nodes >> that are seeds need some 60 seconds, so the additional 30 seconds is a >> buffer. Additional nodes each wait 60 seconds before joining (although this >> is a parallel tree for large clusters). >> >> >> >> >> >> *...* >> >> >> >> >> >> >> *“Life should not be a journey to the grave with the intention of >> arriving safely in apretty and well preserved body, but rather to skid in >> broadside in a cloud of smoke,thoroughly used up, totally worn out, and >> loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. >> ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) >> 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>* >> >> On Tue, Jan 12, 2016 at 6:56 AM, Adil wrote: >> >>> Hi, >>> >>> we have two DC with 5 nodes in each cluster, yesterday there was an >>> electricity outage causing all nodes down, we restart the clusters but when >>> we run nodetool status on DC1 it results that some nodes are DN, the >>> strange thing is that running the command from diffrent node in DC1 doesn't >>> give the same node in DC as own, we have noticed this message in the log >>> "received an invalid gossip generation for peer", does anyone know how to >>> resolve this problem? should we purge the gossip? >>> >>> thanks >>> >>> Adil >>> >> >> >
Re: electricity outage problem
Nodes need about 60-90 second delay before it can start accepting connections as a seed node. Also a seed node needs time to accept a node starting up, and syncing to other nodes (on 10gbit the max new nodes is only 1 or 2, on 1gigabit it can handle at least 3-4 new nodes connecting). In a large cluster (500 nodes) I see this wierd condition where nodetool status shows overlapping subsets of nodes, and the problem does not go away after even an hour on a 10 gigabit network). *...* *“Life should not be a journey to the grave with the intention of arriving safely in apretty and well preserved body, but rather to skid in broadside in a cloud of smoke,thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Fri, Jan 15, 2016 at 9:17 AM, Adilwrote: > Hi, > we did full restart of the cluster but nodetool status still giving > incoerent info from different nodes, some nodes appers UP from a node but > appers DOWN from another, and in the log as is said still having the > message "received an invalid gossip generation for peer /x.x.x.x" > cassandra version is 2.1.2, we want to execute the purge operation as > explained here > https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_gossip_purge.html > but we don't found the peers folder, should we do it via cql deleting the > peers content? should we do it for all nodes? > > thanks > > > 2016-01-12 17:42 GMT+01:00 Jack Krupansky : > >> Sometimes you may have to clear out the saved Gossip state: >> >> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html >> >> Note the instruction about bringing up the seed nodes first. Normally >> seed nodes are only relevant when initially joining a node to a cluster >> (and then the Gossip state will be persisted locally), but if you clear te >> persisted Gossip state the seed nodes will again be needed to find the rest >> of the cluster. >> >> I'm not sure whether a power outage is the same as stopping and >> restarting an instance (AWS) in terms of whether the restarted instance >> retains its current public IP address. >> >> >> >> -- Jack Krupansky >> >> On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle >> wrote: >> >>> This happens when there is insufficient time for nodes coming up to join >>> a network. It takes a few seconds for a node to come up, e.g. your seed >>> node. If you tell a node to join a cluster you can get this scenario >>> because of high network utilization as well. I wait 90 seconds after the >>> first (i.e. my first seed) node comes up to start the next one. Any nodes >>> that are seeds need some 60 seconds, so the additional 30 seconds is a >>> buffer. Additional nodes each wait 60 seconds before joining (although this >>> is a parallel tree for large clusters). >>> >>> >>> >>> >>> >>> *...* >>> >>> >>> >>> >>> >>> >>> *“Life should not be a journey to the grave with the intention of >>> arriving safely in apretty and well preserved body, but rather to skid in >>> broadside in a cloud of smoke,thoroughly used up, totally worn out, and >>> loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. >>> ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) >>> 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>* >>> >>> On Tue, Jan 12, 2016 at 6:56 AM, Adil wrote: >>> Hi, we have two DC with 5 nodes in each cluster, yesterday there was an electricity outage causing all nodes down, we restart the clusters but when we run nodetool status on DC1 it results that some nodes are DN, the strange thing is that running the command from diffrent node in DC1 doesn't give the same node in DC as own, we have noticed this message in the log "received an invalid gossip generation for peer", does anyone know how to resolve this problem? should we purge the gossip? thanks Adil >>> >>> >> >
Re: electricity outage problem
our case is not about accepting connection, some nodes receives gossip generation number greater the local one, a looked at the tables peers and local and can't found where local one is stored. 2016-01-15 17:54 GMT+01:00 daemeon reiydelle: > Nodes need about 60-90 second delay before it can start accepting > connections as a seed node. Also a seed node needs time to accept a node > starting up, and syncing to other nodes (on 10gbit the max new nodes is > only 1 or 2, on 1gigabit it can handle at least 3-4 new nodes connecting). > In a large cluster (500 nodes) I see this wierd condition where nodetool > status shows overlapping subsets of nodes, and the problem does not go away > after even an hour on a 10 gigabit network). > > > > *...* > > > > > > > *“Life should not be a journey to the grave with the intention of arriving > safely in apretty and well preserved body, but rather to skid in broadside > in a cloud of smoke,thoroughly used up, totally worn out, and loudly > proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA > (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Fri, Jan 15, 2016 at 9:17 AM, Adil wrote: > >> Hi, >> we did full restart of the cluster but nodetool status still giving >> incoerent info from different nodes, some nodes appers UP from a node but >> appers DOWN from another, and in the log as is said still having the >> message "received an invalid gossip generation for peer /x.x.x.x" >> cassandra version is 2.1.2, we want to execute the purge operation as >> explained here >> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_gossip_purge.html >> but we don't found the peers folder, should we do it via cql deleting the >> peers content? should we do it for all nodes? >> >> thanks >> >> >> 2016-01-12 17:42 GMT+01:00 Jack Krupansky : >> >>> Sometimes you may have to clear out the saved Gossip state: >>> >>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html >>> >>> Note the instruction about bringing up the seed nodes first. Normally >>> seed nodes are only relevant when initially joining a node to a cluster >>> (and then the Gossip state will be persisted locally), but if you clear te >>> persisted Gossip state the seed nodes will again be needed to find the rest >>> of the cluster. >>> >>> I'm not sure whether a power outage is the same as stopping and >>> restarting an instance (AWS) in terms of whether the restarted instance >>> retains its current public IP address. >>> >>> >>> >>> -- Jack Krupansky >>> >>> On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle >>> wrote: >>> This happens when there is insufficient time for nodes coming up to join a network. It takes a few seconds for a node to come up, e.g. your seed node. If you tell a node to join a cluster you can get this scenario because of high network utilization as well. I wait 90 seconds after the first (i.e. my first seed) node comes up to start the next one. Any nodes that are seeds need some 60 seconds, so the additional 30 seconds is a buffer. Additional nodes each wait 60 seconds before joining (although this is a parallel tree for large clusters). *...* *“Life should not be a journey to the grave with the intention of arriving safely in apretty and well preserved body, but rather to skid in broadside in a cloud of smoke,thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>* On Tue, Jan 12, 2016 at 6:56 AM, Adil wrote: > Hi, > > we have two DC with 5 nodes in each cluster, yesterday there was an > electricity outage causing all nodes down, we restart the clusters but > when > we run nodetool status on DC1 it results that some nodes are DN, the > strange thing is that running the command from diffrent node in DC1 > doesn't > give the same node in DC as own, we have noticed this message in the log > "received an invalid gossip generation for peer", does anyone know how to > resolve this problem? should we purge the gossip? > > thanks > > Adil > >>> >> >
Re: electricity outage problem
This happens when there is insufficient time for nodes coming up to join a network. It takes a few seconds for a node to come up, e.g. your seed node. If you tell a node to join a cluster you can get this scenario because of high network utilization as well. I wait 90 seconds after the first (i.e. my first seed) node comes up to start the next one. Any nodes that are seeds need some 60 seconds, so the additional 30 seconds is a buffer. Additional nodes each wait 60 seconds before joining (although this is a parallel tree for large clusters). *...* *“Life should not be a journey to the grave with the intention of arriving safely in apretty and well preserved body, but rather to skid in broadside in a cloud of smoke,thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Tue, Jan 12, 2016 at 6:56 AM, Adilwrote: > Hi, > > we have two DC with 5 nodes in each cluster, yesterday there was an > electricity outage causing all nodes down, we restart the clusters but when > we run nodetool status on DC1 it results that some nodes are DN, the > strange thing is that running the command from diffrent node in DC1 doesn't > give the same node in DC as own, we have noticed this message in the log > "received an invalid gossip generation for peer", does anyone know how to > resolve this problem? should we purge the gossip? > > thanks > > Adil >
electricity outage problem
Hi, we have two DC with 5 nodes in each cluster, yesterday there was an electricity outage causing all nodes down, we restart the clusters but when we run nodetool status on DC1 it results that some nodes are DN, the strange thing is that running the command from diffrent node in DC1 doesn't give the same node in DC as own, we have noticed this message in the log "received an invalid gossip generation for peer", does anyone know how to resolve this problem? should we purge the gossip? thanks Adil
Re: electricity outage problem
Sometimes you may have to clear out the saved Gossip state: https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html Note the instruction about bringing up the seed nodes first. Normally seed nodes are only relevant when initially joining a node to a cluster (and then the Gossip state will be persisted locally), but if you clear te persisted Gossip state the seed nodes will again be needed to find the rest of the cluster. I'm not sure whether a power outage is the same as stopping and restarting an instance (AWS) in terms of whether the restarted instance retains its current public IP address. -- Jack Krupansky On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydellewrote: > This happens when there is insufficient time for nodes coming up to join a > network. It takes a few seconds for a node to come up, e.g. your seed node. > If you tell a node to join a cluster you can get this scenario because of > high network utilization as well. I wait 90 seconds after the first (i.e. > my first seed) node comes up to start the next one. Any nodes that are > seeds need some 60 seconds, so the additional 30 seconds is a buffer. > Additional nodes each wait 60 seconds before joining (although this is a > parallel tree for large clusters). > > > > > > *...* > > > > > > > *“Life should not be a journey to the grave with the intention of arriving > safely in apretty and well preserved body, but rather to skid in broadside > in a cloud of smoke,thoroughly used up, totally worn out, and loudly > proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA > (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Tue, Jan 12, 2016 at 6:56 AM, Adil wrote: > >> Hi, >> >> we have two DC with 5 nodes in each cluster, yesterday there was an >> electricity outage causing all nodes down, we restart the clusters but when >> we run nodetool status on DC1 it results that some nodes are DN, the >> strange thing is that running the command from diffrent node in DC1 doesn't >> give the same node in DC as own, we have noticed this message in the log >> "received an invalid gossip generation for peer", does anyone know how to >> resolve this problem? should we purge the gossip? >> >> thanks >> >> Adil >> > >