Hello, Sorry again. We found yet another weird thing in this. If we stop nodes with systemctl or just kill (TERM), it causes the problem, but if we kill -9, it doesn't cause the problem.
Thanks, Hiro On Wed, Apr 24, 2019 at 11:31 PM Hiroyuki Yamada <mogwa...@gmail.com> wrote: > Sorry, I didn't write the version and the configurations. > I've tested with C* 3.11.4, and > the configurations are mostly set to default except for the replication > factor and listen_address for proper networking. > > Thanks, > Hiro > > On Wed, Apr 24, 2019 at 5:12 PM Hiroyuki Yamada <mogwa...@gmail.com> > wrote: > >> Hello Ben, >> >> Thank you for the quick reply. >> I haven't tried that case, but it does't recover even if I stopped the >> stress. >> >> Thanks, >> Hiro >> >> On Wed, Apr 24, 2019 at 3:36 PM Ben Slater <ben.sla...@instaclustr.com> >> wrote: >> >>> Is it possible that stress is overloading node 1 so it’s not recovering >>> state properly when node 2 comes up? Have you tried running with a lower >>> load (say 2 or 3 threads)? >>> >>> Cheers >>> Ben >>> >>> --- >>> >>> >>> *Ben Slater* >>> *Chief Product Officer* >>> >>> >>> <https://www.facebook.com/instaclustr> >>> <https://twitter.com/instaclustr> >>> <https://www.linkedin.com/company/instaclustr> >>> >>> Read our latest technical blog posts here >>> <https://www.instaclustr.com/blog/>. >>> >>> This email has been sent on behalf of Instaclustr Pty. Limited >>> (Australia) and Instaclustr Inc (USA). >>> >>> This email and any attachments may contain confidential and legally >>> privileged information. If you are not the intended recipient, do not copy >>> or disclose its content, but please reply to this email immediately and >>> highlight the error to the sender and then immediately delete the message. >>> >>> >>> On Wed, 24 Apr 2019 at 16:28, Hiroyuki Yamada <mogwa...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> I faced a weird issue when recovering a cluster after two nodes are >>>> stopped. >>>> It is easily reproduce-able and looks like a bug or an issue to fix, >>>> so let me write down the steps to reproduce. >>>> >>>> === STEPS TO REPRODUCE === >>>> * Create a 3-node cluster with RF=3 >>>> - node1(seed), node2, node3 >>>> * Start requests to the cluster with cassandra-stress (it continues >>>> until the end) >>>> - what we did: cassandra-stress mixed cl=QUORUM duration=10m >>>> -errors ignore -node node1,node2,node3 -rate threads\>=16 >>>> threads\<=256 >>>> * Stop node3 normally (with systemctl stop) >>>> - the system is still available because the quorum of nodes is >>>> still available >>>> * Stop node2 normally (with systemctl stop) >>>> - the system is NOT available after it's stopped. >>>> - the client gets `UnavailableException: Not enough replicas >>>> available for query at consistency QUORUM` >>>> - the client gets errors right away (so few ms) >>>> - so far it's all expected >>>> * Wait for 1 mins >>>> * Bring up node2 >>>> - The issue happens here. >>>> - the client gets ReadTimeoutException` or WriteTimeoutException >>>> depending on if the request is read or write even after the node2 is >>>> up >>>> - the client gets errors after about 5000ms or 2000ms, which are >>>> request timeout for write and read request >>>> - what node1 reports with `nodetool status` and what node2 reports >>>> are not consistent. (node2 thinks node1 is down) >>>> - It takes very long time to recover from its state >>>> === STEPS TO REPRODUCE === >>>> >>>> Is it supposed to happen ? >>>> If we don't start cassandra-stress, it's all fine. >>>> >>>> Some workarounds we found to recover the state are the followings: >>>> * Restarting node1 and it recovers its state right after it's restarted >>>> * Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000 >>>> or something) >>>> >>>> I don't think either of them is a really good solution. >>>> Can anyone explain what is going on and what is the best way to make >>>> it not happen or recover ? >>>> >>>> Thanks, >>>> Hiro >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>> For additional commands, e-mail: user-h...@cassandra.apache.org >>>> >>>>