Re: Rebuild to a new DC fails every time
Thanks for the tips, Alan. The cluster is entirely healthy. But the connection between DCs is a VPN, managed by a third party - it is possible it might be flaky. However, I would expect the rebuild job to be able to recover from connection timeout/reset type of errors without a need for manual intervention. In the end we opted for restore from snapshot + repair, to bring up the node in the new DC. We'll see how that goes. Regards, Martin - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Rebuild to a new DC fails every time
Hello Martin. Did you solve your issue? I would say that this exception could be due to 'streaming_socket_timeout_in_ms' indeed. Make sure you have a large value enough or indeed upgrade to a newer version implementing the keep alive is indeed an interesting thing to try. The thing is if you are trying to add a DC, it might not be the best moment for an upgrade. It is clear to me that using a keep-alive here is better, so if it is a good fit upgrading could definitely help. Another reason I can think of would be network issue of some kind such as a flaky cross DC connection, a node going down, strictly or just bouncing because of GC or any other reason. I believe this kind of events are not well handled by the streaming process yet. Is the cluster healthy overall? Do you have pending / dropped messages of some kind, GC pressure, log warnings and errors or any other troubles? Let us know how it goes :). C*heers, --- Alain Rodriguez - @arodream - al...@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2018-01-08 14:31 GMT+00:00 Martin Mačura : > None of the files is listed more than once in the logs: > > java.lang.RuntimeException: Transfer of file > /fs3/cassandra/data//event_group-3b5782d08e4411e68 > 42917253f111990/mc-116042-big-Data.db > already completed or aborted (perhaps session failed?). > java.lang.RuntimeException: Transfer of file > /fs0/cassandra/data//event_group-3b5782d08e4411e68 > 42917253f111990/mc-111370-big-Data.db > already completed or aborted (perhaps session failed?). > java.lang.RuntimeException: Transfer of file > /fs3/cassandra/data//event_alert-13d78e3f11e6a > 6cbe1698349da4d/mc-8659-big-Data.db > already completed or aborted (perhaps session failed?). > java.lang.RuntimeException: Transfer of file > /fs4/cassandra/data//event_alert-13d78e3f11e6a > 6cbe1698349da4d/mc-9133-big-Data.db > already completed or aborted (perhaps session failed?). > java.lang.RuntimeException: Transfer of file > /fs2/cassandra/data//event_alert-13d78e3f11e6a > 6cbe1698349da4d/mc-3997-big-Data.db > already completed or aborted (perhaps session failed?). > java.lang.RuntimeException: Transfer of file > /fs1/cassandra/data///event_group-3b5782d08e4411e6 > 842917253f111990/mc-152979-big-Data.db > already completed or aborted (perhaps session failed?). > > > > > On Mon, Jan 8, 2018 at 2:21 AM, kurt greaves wrote: > > If you're on 3.9 it's likely unrelated as streaming_socket_timeout_in_ms > is > > 48 hours. Appears rebuild is trying to stream the same file twice. Are > there > > other exceptions in the logs related to the file, or can you find out if > > it's previously been sent by the same session? Search the logs for the > file > > that failed and post back any exceptions. > > > > On 29 December 2017 at 10:18, Martin Mačura wrote: > >> > >> Is this something that can be resolved by CASSANDRA-11841 ? > >> > >> Thanks, > >> > >> Martin > >> > >> On Thu, Dec 21, 2017 at 3:02 PM, Martin Mačura > wrote: > >> > Hi all, > >> > we are trying to add a new datacenter to the existing cluster, but the > >> > 'nodetool rebuild' command always fails after a couple of hours. > >> > > >> > We're on Cassandra 3.9. > >> > > >> > Example 1: > >> > > >> > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:55735] 2017-12-13 > >> > 23:55:38,840 StreamResultFuture.java:174 - [Stream > >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. > >> > Receiving 0 files(0.000KiB), sending 9844 files(885.587GiB) > >> > 172.25.16.125 INFO [STREAM-IN-/172.24.16.169:7000] 2017-12-13 > >> > 23:55:38,858 StreamResultFuture.java:174 - [Stream > >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. > >> > Receiving 9844 files(885.587GiB), sending 0 files(0.000KiB) > >> > > >> > 172.24.16.169 ERROR [STREAM-IN-/172.25.16.125:55735] 2017-12-14 > >> > 04:28:09,064 StreamSession.java:533 - [Stream > >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > >> > session with peer 172.25.16.125 > >> > 172.24.16.169 java.io.IOException: Connection reset by peer > >> > > >> > 172.24.16.169 ERROR [STREAM-OUT-/172.25.16.125:49412] 2017-12-14 > >> > 07:26:26,832 StreamSession.java:533 - [Stream > >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > >> > session with peer 172.25.16.125 > >> > 172.24.16.169 java.lang.RuntimeException: Transfer of file > >> > -13d78e3f11e6a6cbe1698349da4d/mc-8659-big-Data.db > >> > already completed or aborted (perhaps session failed?). > >> > 172.25.16.125 ERROR [STREAM-OUT-/172.24.16.169:7000] 2017-12-14 > >> > 07:26:50,004 StreamSession.java:533 - [Stream > >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > >> > session with peer 172.24.16.169 > >> > 172.25.16.125 java.io.IOException: Connection reset by peer > >> > > >> > Example 2: > >> > > >> > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:35202] 2017-12-18 > >> > 03:24:31,423 StreamR
Re: Rebuild to a new DC fails every time
None of the files is listed more than once in the logs: java.lang.RuntimeException: Transfer of file /fs3/cassandra/data//event_group-3b5782d08e4411e6842917253f111990/mc-116042-big-Data.db already completed or aborted (perhaps session failed?). java.lang.RuntimeException: Transfer of file /fs0/cassandra/data//event_group-3b5782d08e4411e6842917253f111990/mc-111370-big-Data.db already completed or aborted (perhaps session failed?). java.lang.RuntimeException: Transfer of file /fs3/cassandra/data//event_alert-13d78e3f11e6a6cbe1698349da4d/mc-8659-big-Data.db already completed or aborted (perhaps session failed?). java.lang.RuntimeException: Transfer of file /fs4/cassandra/data//event_alert-13d78e3f11e6a6cbe1698349da4d/mc-9133-big-Data.db already completed or aborted (perhaps session failed?). java.lang.RuntimeException: Transfer of file /fs2/cassandra/data//event_alert-13d78e3f11e6a6cbe1698349da4d/mc-3997-big-Data.db already completed or aborted (perhaps session failed?). java.lang.RuntimeException: Transfer of file /fs1/cassandra/data///event_group-3b5782d08e4411e6842917253f111990/mc-152979-big-Data.db already completed or aborted (perhaps session failed?). On Mon, Jan 8, 2018 at 2:21 AM, kurt greaves wrote: > If you're on 3.9 it's likely unrelated as streaming_socket_timeout_in_ms is > 48 hours. Appears rebuild is trying to stream the same file twice. Are there > other exceptions in the logs related to the file, or can you find out if > it's previously been sent by the same session? Search the logs for the file > that failed and post back any exceptions. > > On 29 December 2017 at 10:18, Martin Mačura wrote: >> >> Is this something that can be resolved by CASSANDRA-11841 ? >> >> Thanks, >> >> Martin >> >> On Thu, Dec 21, 2017 at 3:02 PM, Martin Mačura wrote: >> > Hi all, >> > we are trying to add a new datacenter to the existing cluster, but the >> > 'nodetool rebuild' command always fails after a couple of hours. >> > >> > We're on Cassandra 3.9. >> > >> > Example 1: >> > >> > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:55735] 2017-12-13 >> > 23:55:38,840 StreamResultFuture.java:174 - [Stream >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. >> > Receiving 0 files(0.000KiB), sending 9844 files(885.587GiB) >> > 172.25.16.125 INFO [STREAM-IN-/172.24.16.169:7000] 2017-12-13 >> > 23:55:38,858 StreamResultFuture.java:174 - [Stream >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. >> > Receiving 9844 files(885.587GiB), sending 0 files(0.000KiB) >> > >> > 172.24.16.169 ERROR [STREAM-IN-/172.25.16.125:55735] 2017-12-14 >> > 04:28:09,064 StreamSession.java:533 - [Stream >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on >> > session with peer 172.25.16.125 >> > 172.24.16.169 java.io.IOException: Connection reset by peer >> > >> > 172.24.16.169 ERROR [STREAM-OUT-/172.25.16.125:49412] 2017-12-14 >> > 07:26:26,832 StreamSession.java:533 - [Stream >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on >> > session with peer 172.25.16.125 >> > 172.24.16.169 java.lang.RuntimeException: Transfer of file >> > -13d78e3f11e6a6cbe1698349da4d/mc-8659-big-Data.db >> > already completed or aborted (perhaps session failed?). >> > 172.25.16.125 ERROR [STREAM-OUT-/172.24.16.169:7000] 2017-12-14 >> > 07:26:50,004 StreamSession.java:533 - [Stream >> > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on >> > session with peer 172.24.16.169 >> > 172.25.16.125 java.io.IOException: Connection reset by peer >> > >> > Example 2: >> > >> > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:35202] 2017-12-18 >> > 03:24:31,423 StreamResultFuture.java:174 - [Stream >> > #95d36300-e3d4-11e7-a90b-2b89506ad2af ID#0] Prepare completed. >> > Receiving 0 files(0.000KiB), sending 12312 files(895.973GiB) >> > 172.25.16.125 INFO [STREAM-IN-/172.24.16.169:7000] 2017-12-18 >> > 03:24:31,441 StreamResultFuture.java:174 - [Stream >> > #95d36300-e3d4-11e7-a90b-2b89506ad2af ID#0] Prepare completed. >> > Receiving 12312 files(895.973GiB), sending 0 files(0.000KiB) >> > >> > 172.24.16.169 ERROR [STREAM-IN-/172.25.16.125:35202] 2017-12-18 >> > 06:39:42,049 StreamSession.java:533 - [Stream >> > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on >> > session with peer 172.25.16.125 >> > 172.24.16.169 java.io.IOException: Connection reset by peer >> > >> > 172.24.16.169 ERROR [STREAM-OUT-/172.25.16.125:42744] 2017-12-18 >> > 09:25:36,188 StreamSession.java:533 - [Stream >> > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on >> > session with peer 172.25.16.125 >> > 172.24.16.169 java.lang.RuntimeException: Transfer of file >> > -3b5782d08e4411e6842917253f111990/mc-152979-big-Data.db >> > already completed or aborted (perhaps session failed?). >> > 172.25.16.125 ERROR [STREAM-OUT-/172.24.16.169:7000] 2017-12-18 >> > 09:25:59,447 StreamSession.java:533 - [Stream >> > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occ
Re: Rebuild to a new DC fails every time
If you're on 3.9 it's likely unrelated as streaming_socket_timeout_in_ms is 48 hours. Appears rebuild is trying to stream the same file twice. Are there other exceptions in the logs related to the file, or can you find out if it's previously been sent by the same session? Search the logs for the file that failed and post back any exceptions. On 29 December 2017 at 10:18, Martin Mačura wrote: > Is this something that can be resolved by CASSANDRA-11841 ? > > Thanks, > > Martin > > On Thu, Dec 21, 2017 at 3:02 PM, Martin Mačura wrote: > > Hi all, > > we are trying to add a new datacenter to the existing cluster, but the > > 'nodetool rebuild' command always fails after a couple of hours. > > > > We're on Cassandra 3.9. > > > > Example 1: > > > > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:55735] 2017-12-13 > > 23:55:38,840 StreamResultFuture.java:174 - [Stream > > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. > > Receiving 0 files(0.000KiB), sending 9844 files(885.587GiB) > > 172.25.16.125 INFO [STREAM-IN-/172.24.16.169:7000] 2017-12-13 > > 23:55:38,858 StreamResultFuture.java:174 - [Stream > > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. > > Receiving 9844 files(885.587GiB), sending 0 files(0.000KiB) > > > > 172.24.16.169 ERROR [STREAM-IN-/172.25.16.125:55735] 2017-12-14 > > 04:28:09,064 StreamSession.java:533 - [Stream > > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > > session with peer 172.25.16.125 > > 172.24.16.169 java.io.IOException: Connection reset by peer > > > > 172.24.16.169 ERROR [STREAM-OUT-/172.25.16.125:49412] 2017-12-14 > > 07:26:26,832 StreamSession.java:533 - [Stream > > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > > session with peer 172.25.16.125 > > 172.24.16.169 java.lang.RuntimeException: Transfer of file > > -13d78e3f11e6a6cbe1698349da4d/mc-8659-big-Data.db > > already completed or aborted (perhaps session failed?). > > 172.25.16.125 ERROR [STREAM-OUT-/172.24.16.169:7000] 2017-12-14 > > 07:26:50,004 StreamSession.java:533 - [Stream > > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > > session with peer 172.24.16.169 > > 172.25.16.125 java.io.IOException: Connection reset by peer > > > > Example 2: > > > > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:35202] 2017-12-18 > > 03:24:31,423 StreamResultFuture.java:174 - [Stream > > #95d36300-e3d4-11e7-a90b-2b89506ad2af ID#0] Prepare completed. > > Receiving 0 files(0.000KiB), sending 12312 files(895.973GiB) > > 172.25.16.125 INFO [STREAM-IN-/172.24.16.169:7000] 2017-12-18 > > 03:24:31,441 StreamResultFuture.java:174 - [Stream > > #95d36300-e3d4-11e7-a90b-2b89506ad2af ID#0] Prepare completed. > > Receiving 12312 files(895.973GiB), sending 0 files(0.000KiB) > > > > 172.24.16.169 ERROR [STREAM-IN-/172.25.16.125:35202] 2017-12-18 > > 06:39:42,049 StreamSession.java:533 - [Stream > > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on > > session with peer 172.25.16.125 > > 172.24.16.169 java.io.IOException: Connection reset by peer > > > > 172.24.16.169 ERROR [STREAM-OUT-/172.25.16.125:42744] 2017-12-18 > > 09:25:36,188 StreamSession.java:533 - [Stream > > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on > > session with peer 172.25.16.125 > > 172.24.16.169 java.lang.RuntimeException: Transfer of file > > -3b5782d08e4411e6842917253f111990/mc-152979-big-Data.db > > already completed or aborted (perhaps session failed?). > > 172.25.16.125 ERROR [STREAM-OUT-/172.24.16.169:7000] 2017-12-18 > > 09:25:59,447 StreamSession.java:533 - [Stream > > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on > > session with peer 172.24.16.169 > > 172.25.16.125 java.io.IOException: Connection timed out > > > > Datacenter: PRIMARY > > === > > Status=Up/Down > > |/ State=Normal/Leaving/Joining/Moving > > -- AddressLoad Tokens Owns (effective) Host ID > > Rack > > UN 172.24.16.169 918.31 GiB 256 100.0% > > bc4a980b-cca6-4ca2-b32f-f8206d48e14c RAC1 > > UN 172.24.16.170 908.76 GiB 256 100.0% > > 37b2742e-c83a-4341-896f-09d244810e69 RAC1 > > UN 172.24.16.171 908.44 GiB 256 100.0% > > 6dc2b9d8-75dd-48f8-858c-53b1af42e8fb RAC1 > > Datacenter: SECONDARY > > = > > Status=Up/Down > > |/ State=Normal/Leaving/Joining/Moving > > -- AddressLoad Tokens Owns (effective) Host ID > > Rack > > UN 172.25.16.125 27.48 GiB 256 100.0% > > 1e1669eb-cfd2-4718-a073-558946a8c947 RAC2 > > UN 172.25.16.124 28.24 GiB 256 100.0% > > 896d9894-10c8-4269-9476-5ddab3c8abe9 RAC2 > > > > Any ideas? > > > > Thanks, > > > > Martin > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >
Re: Rebuild to a new DC fails every time
Is this something that can be resolved by CASSANDRA-11841 ? Thanks, Martin On Thu, Dec 21, 2017 at 3:02 PM, Martin Mačura wrote: > Hi all, > we are trying to add a new datacenter to the existing cluster, but the > 'nodetool rebuild' command always fails after a couple of hours. > > We're on Cassandra 3.9. > > Example 1: > > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:55735] 2017-12-13 > 23:55:38,840 StreamResultFuture.java:174 - [Stream > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. > Receiving 0 files(0.000KiB), sending 9844 files(885.587GiB) > 172.25.16.125 INFO [STREAM-IN-/172.24.16.169:7000] 2017-12-13 > 23:55:38,858 StreamResultFuture.java:174 - [Stream > #b8faf130-e092-11e7-bab5-0d4fb7c90e72 ID#0] Prepare completed. > Receiving 9844 files(885.587GiB), sending 0 files(0.000KiB) > > 172.24.16.169 ERROR [STREAM-IN-/172.25.16.125:55735] 2017-12-14 > 04:28:09,064 StreamSession.java:533 - [Stream > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > session with peer 172.25.16.125 > 172.24.16.169 java.io.IOException: Connection reset by peer > > 172.24.16.169 ERROR [STREAM-OUT-/172.25.16.125:49412] 2017-12-14 > 07:26:26,832 StreamSession.java:533 - [Stream > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > session with peer 172.25.16.125 > 172.24.16.169 java.lang.RuntimeException: Transfer of file > -13d78e3f11e6a6cbe1698349da4d/mc-8659-big-Data.db > already completed or aborted (perhaps session failed?). > 172.25.16.125 ERROR [STREAM-OUT-/172.24.16.169:7000] 2017-12-14 > 07:26:50,004 StreamSession.java:533 - [Stream > #b8faf130-e092-11e7-bab5-0d4fb7c90e72] Streaming error occurred on > session with peer 172.24.16.169 > 172.25.16.125 java.io.IOException: Connection reset by peer > > Example 2: > > 172.24.16.169 INFO [STREAM-IN-/172.25.16.125:35202] 2017-12-18 > 03:24:31,423 StreamResultFuture.java:174 - [Stream > #95d36300-e3d4-11e7-a90b-2b89506ad2af ID#0] Prepare completed. > Receiving 0 files(0.000KiB), sending 12312 files(895.973GiB) > 172.25.16.125 INFO [STREAM-IN-/172.24.16.169:7000] 2017-12-18 > 03:24:31,441 StreamResultFuture.java:174 - [Stream > #95d36300-e3d4-11e7-a90b-2b89506ad2af ID#0] Prepare completed. > Receiving 12312 files(895.973GiB), sending 0 files(0.000KiB) > > 172.24.16.169 ERROR [STREAM-IN-/172.25.16.125:35202] 2017-12-18 > 06:39:42,049 StreamSession.java:533 - [Stream > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on > session with peer 172.25.16.125 > 172.24.16.169 java.io.IOException: Connection reset by peer > > 172.24.16.169 ERROR [STREAM-OUT-/172.25.16.125:42744] 2017-12-18 > 09:25:36,188 StreamSession.java:533 - [Stream > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on > session with peer 172.25.16.125 > 172.24.16.169 java.lang.RuntimeException: Transfer of file > -3b5782d08e4411e6842917253f111990/mc-152979-big-Data.db > already completed or aborted (perhaps session failed?). > 172.25.16.125 ERROR [STREAM-OUT-/172.24.16.169:7000] 2017-12-18 > 09:25:59,447 StreamSession.java:533 - [Stream > #95d36300-e3d4-11e7-a90b-2b89506ad2af] Streaming error occurred on > session with peer 172.24.16.169 > 172.25.16.125 java.io.IOException: Connection timed out > > Datacenter: PRIMARY > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 172.24.16.169 918.31 GiB 256 100.0% > bc4a980b-cca6-4ca2-b32f-f8206d48e14c RAC1 > UN 172.24.16.170 908.76 GiB 256 100.0% > 37b2742e-c83a-4341-896f-09d244810e69 RAC1 > UN 172.24.16.171 908.44 GiB 256 100.0% > 6dc2b9d8-75dd-48f8-858c-53b1af42e8fb RAC1 > Datacenter: SECONDARY > = > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 172.25.16.125 27.48 GiB 256 100.0% > 1e1669eb-cfd2-4718-a073-558946a8c947 RAC2 > UN 172.25.16.124 28.24 GiB 256 100.0% > 896d9894-10c8-4269-9476-5ddab3c8abe9 RAC2 > > Any ideas? > > Thanks, > > Martin - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org