[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node2-zookeeper.log.gz
node1-zookeeper.log.gz
zoo.cfg

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-zookeeper.log.gz, node2-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node2-version-2.tgz-aa

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-zookeeper.log.gz, 
 node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node1-version-2.tgz-ab

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-zookeeper.log.gz, 
 node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: (was: node2-version-2.tgz-aa)

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node2-version-2.tgz-ab

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: (was: node2-version-2.tgz-ab)

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node2-version-2.tgz-ab

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node2-version-2.tgz-aa

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node2-version-2.tgz-ac

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, 
 node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node3-version-2.tgz-ab

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, 
 node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node3-version-2.tgz-aa

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, 
 node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Attachment: node3-version-2.tgz-ac

 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, 
 node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, 
 zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made 
 cluster unable to start?
 Fortunately we could rebuild data we store in zookeeper as we use it only for 
 locks and most of nodes is ephemeral.
 I am attaching contents of version-2 directory from all nodes and server logs.
 Source problem occurred some time before 15. First cluster restart happened 
 at 15:03.
 At some point later we experimented with deleting version-2 directory so I 
 would not look at following restart because they can be misleading due to our 
 actions.
 I am also attaching zoo.cfg. Maybe something is wrong at this place. 
 As I know look into logs i see read timeout during initialization phase after 
 20secs (initLimit=10, tickTime=2000).
 Maybe all I have to do is increase one or other. which one? Are there any 
 downsides of increasing tickTime.
 Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

2010-03-18 Thread Lukasz Osipiuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-

Description: 
Hi guys,

The following is not a bug report but rather a question - but as I am attaching 
large files I am posting it here rather than on mailinglist.

Today we had major failure in our production environment. Machines in zookeeper 
cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in leader 
election phase.

Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer 
not running' message
In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all 
version-2 directories on all nodes and then cluster started up without problems.
Is it possible that snapshot/log data got corrupted in a way which made cluster 
unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it only for 
locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart happened at 
15:03.
At some point later we experimented with deleting version-2 directory so I 
would not look at following restart because they can be misleading due to our 
actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place. 
As I know look into logs i see read timeout during initialization phase after 
20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there any 
downsides of increasing tickTime.

Best regards, Łukasz Osipiuk

PS. due to attachment size limit I used split. to untar use 
cat nodeX-version-2.tgz-* |tar -xz


  was:
Hi guys,

The following is not a bug report but rather a question - but as I am attaching 
large files I am posting it here rather than on mailinglist.

Today we had major failure in our production environment. Machines in zookeeper 
cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in leader 
election phase.

Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer 
not running' message
In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all 
version-2 directories on all nodes and then cluster started up without problems.
Is it possible that snapshot/log data got corrupted in a way which made cluster 
unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it only for 
locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart happened at 
15:03.
At some point later we experimented with deleting version-2 directory so I 
would not look at following restart because they can be misleading due to our 
actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place. 
As I know look into logs i see read timeout during initialization phase after 
20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there any 
downsides of increasing tickTime.

Best regards, Łukasz Osipiuk




 zookeeper fails to start - broken snapshot?
 ---

 Key: ZOOKEEPER-713
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
 Environment: debian lenny; ia64; xen virtualization
Reporter: Lukasz Osipiuk
 Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
 node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
 node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, 
 node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, 
 zoo.cfg


 Hi guys,
 The following is not a bug report but rather a question - but as I am 
 attaching large files I am posting it here rather than on mailinglist.
 Today we had major failure in our production environment. Machines in 
 zookeeper cluster gone wild and all clients got disconnected.
 We tried to restart whole zookeeper cluster but cluster got stuck in leader 
 election phase.
 Calling stat command on any machine in the cluster resulted in 
 'ZooKeeperServer not running' message
 In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
 We did not manage to make cluster work again with data. We deleted all 
 version-2 directories on all nodes and then cluster started up without 
 problems.
 Is it possible that snapshot/log data got corrupted in a way which made