Quick drive-by observation: > Did not get replies from all endpoints.. Check the > logs on the repair participants for further details
> dropping message of type HINT_REQ due to error > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The > channel this output stream was writing to has been closed > Caused by: io.netty.channel.unix.Errors$NativeIoException: > writeAddress(..) failed: Connection timed out > java.lang.RuntimeException: Did not get replies from all endpoints. These all point to the same shaped problem: for whatever reason, the coordinator of this repair didn't receive replies from the replicas executing it. Could be that they're dead, could be they took too long, could be they never got the start message, etc. Distributed operations are tricky like that. Logs on the replicas doing the actual repairs should give you more insight; this is a pretty low level generic set of errors that basically amounts to "we didn't hear back from the other participants in time so we timed out." On Fri, Aug 4, 2023, at 12:02 PM, Surbhi Gupta wrote: > Can you please try to do nodetool describecluster from every node of the > cluster? > > One time I noticed issue when nodetool status shows all nodes UN but > describecluster was not. > > Thanks > Surbhi > > On Fri, Aug 4, 2023 at 8:59 AM Joe Obernberger <joseph.obernber...@gmail.com> > wrote: >> Hi All - been using reaper to do repairs, but it has hung. I tried to run: >> nodetool repair -pr >> on each of the nodes, but they all fail with some form of this error: >> >> error: Repair job has failed with the error message: Repair command #521 >> failed with error Did not get replies from all endpoints.. Check the >> logs on the repair participants for further details >> -- StackTrace -- >> java.lang.RuntimeException: Repair job has failed with the error >> message: Repair command #521 failed with error Did not get replies from >> all endpoints.. Check the logs on the repair participants for further >> details >> at >> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137) >> at >> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108) >> at java.base/java.lang.Thread.run(Thread.java:829) >> >> Using version 4.1.2-1 >> nodetool status >> Datacenter: datacenter1 >> ======================= >> Status=Up/Down >> |/ State=Normal/Leaving/Joining/Moving >> -- Address Load Tokens Owns Host >> ID Rack >> UN 172.16.100.45 505.66 GiB 250 ? >> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1 >> UN 172.16.100.251 380.75 GiB 200 ? >> 274a6e8d-de37-4e0b-b000-02d221d858a5 rack1 >> UN 172.16.100.35 479.2 GiB 200 ? >> 59150c47-274a-46fb-9d5e-bed468d36797 rack1 >> UN 172.16.100.252 248.69 GiB 200 ? >> 8f0d392f-0750-44e2-91a5-b30708ade8e4 rack1 >> UN 172.16.100.249 411.53 GiB 200 ? >> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1 >> UN 172.16.100.38 333.26 GiB 200 ? >> 0d9509cc-2f23-4117-a883-469a1be54baf rack1 >> UN 172.16.100.36 405.33 GiB 200 ? >> d9702f96-256e-45ae-8e12-69a42712be50 rack1 >> UN 172.16.100.39 437.74 GiB 200 ? >> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1 >> UN 172.16.100.248 344.4 GiB 200 ? >> 4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1 >> UN 172.16.100.44 409.36 GiB 200 ? >> b2e5366e-8386-40ec-a641-27944a5a7cfa rack1 >> UN 172.16.100.37 236.08 GiB 120 ? >> 08a19658-40be-4e55-8709-812b3d4ac750 rack1 >> UN 172.16.20.16 975 GiB 500 ? >> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1 >> UN 172.16.100.34 340.77 GiB 200 ? >> 352fd049-32f8-4be8-9275-68b145ac2832 rack1 >> UN 172.16.100.42 974.86 GiB 500 ? >> b088a8e6-42f3-4331-a583-47ef5149598f rack1 >> >> Note: Non-system keyspaces don't have the same replication settings, >> effective ownership information is meaningless >> >> Debug log has: >> >> >> DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955 >> MigrationCoordinator.java:264 - Pulling unreceived schema versions... >> INFO [HintsDispatcher:11344] 2023-08-04 11:56:21,369 >> HintsDispatchExecutor.java:318 - Finished hinted handoff of file >> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint >> /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially >> WARN >> [Messaging-OUT-/172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES] >> 2023-08-04 11:56:21,916 OutboundConnection.java:491 - >> /172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES-[no-channel] >> dropping message of type HINT_REQ due to error >> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The >> channel this output stream was writing to has been closed >> at >> org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) >> at >> org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) >> at >> org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) >> at >> org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) >> at >> org.apache.cassandra.net.AsyncMessageOutputPlus.doFlush(AsyncMessageOutputPlus.java:100) >> at >> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:122) >> at >> org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:139) >> at >> org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:77) >> at >> org.apache.cassandra.net.Message$Serializer.serializePost40(Message.java:844) >> at >> org.apache.cassandra.net.Message$Serializer.serialize(Message.java:702) >> at >> org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984) >> at >> org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:690) >> at >> org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:958) >> at >> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124) >> at >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >> at >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >> at >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> at java.base/java.lang.Thread.run(Thread.java:829) >> Caused by: io.netty.channel.unix.Errors$NativeIoException: >> writeAddress(..) failed: Connection timed out >> INFO [Messaging-EventLoop-3-16] 2023-08-04 11:56:21,918 >> OutboundConnection.java:1153 - >> /172.16.100.34:7000(/172.16.100.34:59198)->/172.16.20.16:7000-LARGE_MESSAGES-2fc2c5b9 >> >> successfully connected, version = 12, framing = CRC, encryption = >> unencrypted >> ERROR [Repair-Task:437] 2023-08-04 11:56:28,592 RepairRunnable.java:160 >> - Repair 30675c00-32df-11ee-a7d8-05183c68b0d0 failed: >> java.lang.RuntimeException: Did not get replies from all endpoints. >> at >> org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:721) >> at >> org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:654) >> at >> org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:400) >> at >> org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:279) >> at >> org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:248) >> at >> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81) >> at >> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47) >> at >> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57) >> at >> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81) >> at >> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47) >> at >> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57) >> at >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >> at >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >> at >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> at java.base/java.lang.Thread.run(Thread.java:829) >> INFO [Repair-Task:437] 2023-08-04 11:56:28,594 RepairRunnable.java:223 >> - [repair #30675c00-32df-11ee-a7d8-05183c68b0d0]Repair command #522 >> finished with error >> >> What to do? >> Thanks! >> >> -Joe >> >> >> -- >> This email has been checked for viruses by AVG antivirus software. >> www.avg.com