Quick drive-by observation:
> Did not get replies from all endpoints.. Check the 
> logs on the repair participants for further details

> dropping message of type HINT_REQ due to error
> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The
> channel this output stream was writing to has been closed

> Caused by: io.netty.channel.unix.Errors$NativeIoException:
> writeAddress(..) failed: Connection timed out

> java.lang.RuntimeException: Did not get replies from all endpoints.
These all point to the same shaped problem: for whatever reason, the 
coordinator of this repair didn't receive replies from the replicas executing 
it. Could be that they're dead, could be they took too long, could be they 
never got the start message, etc. Distributed operations are tricky like that.

Logs on the replicas doing the actual repairs should give you more insight; 
this is a pretty low level generic set of errors that basically amounts to "we 
didn't hear back from the other participants in time so we timed out."

On Fri, Aug 4, 2023, at 12:02 PM, Surbhi Gupta wrote:
> Can you please try to do nodetool describecluster from every node of the 
> cluster?
> 
> One time I noticed issue when nodetool status shows all nodes UN but 
> describecluster was not.
> 
> Thanks
> Surbhi
> 
> On Fri, Aug 4, 2023 at 8:59 AM Joe Obernberger <joseph.obernber...@gmail.com> 
> wrote:
>> Hi All - been using reaper to do repairs, but it has hung.  I tried to run:
>> nodetool repair -pr
>> on each of the nodes, but they all fail with some form of this error:
>> 
>> error: Repair job has failed with the error message: Repair command #521 
>> failed with error Did not get replies from all endpoints.. Check the 
>> logs on the repair participants for further details
>> -- StackTrace --
>> java.lang.RuntimeException: Repair job has failed with the error 
>> message: Repair command #521 failed with error Did not get replies from 
>> all endpoints.. Check the logs on the repair participants for further 
>> details
>>          at 
>> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
>>          at 
>> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>>          at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633)
>>          at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555)
>>          at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474)
>>          at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108)
>>          at java.base/java.lang.Thread.run(Thread.java:829)
>> 
>> Using version 4.1.2-1
>> nodetool status
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address         Load        Tokens  Owns  Host 
>> ID                               Rack
>> UN  172.16.100.45   505.66 GiB  250     ? 
>> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1
>> UN  172.16.100.251  380.75 GiB  200     ? 
>> 274a6e8d-de37-4e0b-b000-02d221d858a5  rack1
>> UN  172.16.100.35   479.2 GiB   200     ? 
>> 59150c47-274a-46fb-9d5e-bed468d36797  rack1
>> UN  172.16.100.252  248.69 GiB  200     ? 
>> 8f0d392f-0750-44e2-91a5-b30708ade8e4  rack1
>> UN  172.16.100.249  411.53 GiB  200     ? 
>> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
>> UN  172.16.100.38   333.26 GiB  200     ? 
>> 0d9509cc-2f23-4117-a883-469a1be54baf  rack1
>> UN  172.16.100.36   405.33 GiB  200     ? 
>> d9702f96-256e-45ae-8e12-69a42712be50  rack1
>> UN  172.16.100.39   437.74 GiB  200     ? 
>> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
>> UN  172.16.100.248  344.4 GiB   200     ? 
>> 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
>> UN  172.16.100.44   409.36 GiB  200     ? 
>> b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
>> UN  172.16.100.37   236.08 GiB  120     ? 
>> 08a19658-40be-4e55-8709-812b3d4ac750  rack1
>> UN  172.16.20.16    975 GiB     500     ? 
>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1
>> UN  172.16.100.34   340.77 GiB  200     ? 
>> 352fd049-32f8-4be8-9275-68b145ac2832  rack1
>> UN  172.16.100.42   974.86 GiB  500     ? 
>> b088a8e6-42f3-4331-a583-47ef5149598f  rack1
>> 
>> Note: Non-system keyspaces don't have the same replication settings, 
>> effective ownership information is meaningless
>> 
>> Debug log has:
>> 
>> 
>> DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955 
>> MigrationCoordinator.java:264 - Pulling unreceived schema versions...
>> INFO  [HintsDispatcher:11344] 2023-08-04 11:56:21,369 
>> HintsDispatchExecutor.java:318 - Finished hinted handoff of file 
>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint 
>> /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially
>> WARN 
>> [Messaging-OUT-/172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES] 
>> 2023-08-04 11:56:21,916 OutboundConnection.java:491 - 
>> /172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES-[no-channel] 
>> dropping message of type HINT_REQ due to error
>> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The 
>> channel this output stream was writing to has been closed
>>          at 
>> org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200)
>>          at 
>> org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)
>>          at 
>> org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140)
>>          at 
>> org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97)
>>          at 
>> org.apache.cassandra.net.AsyncMessageOutputPlus.doFlush(AsyncMessageOutputPlus.java:100)
>>          at 
>> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:122)
>>          at 
>> org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:139)
>>          at 
>> org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:77)
>>          at 
>> org.apache.cassandra.net.Message$Serializer.serializePost40(Message.java:844)
>>          at 
>> org.apache.cassandra.net.Message$Serializer.serialize(Message.java:702)
>>          at 
>> org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984)
>>          at 
>> org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:690)
>>          at 
>> org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:958)
>>          at 
>> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
>>          at 
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>          at 
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>          at 
>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>          at java.base/java.lang.Thread.run(Thread.java:829)
>> Caused by: io.netty.channel.unix.Errors$NativeIoException: 
>> writeAddress(..) failed: Connection timed out
>> INFO  [Messaging-EventLoop-3-16] 2023-08-04 11:56:21,918 
>> OutboundConnection.java:1153 - 
>> /172.16.100.34:7000(/172.16.100.34:59198)->/172.16.20.16:7000-LARGE_MESSAGES-2fc2c5b9
>>  
>> successfully connected, version = 12, framing = CRC, encryption = 
>> unencrypted
>> ERROR [Repair-Task:437] 2023-08-04 11:56:28,592 RepairRunnable.java:160 
>> - Repair 30675c00-32df-11ee-a7d8-05183c68b0d0 failed:
>> java.lang.RuntimeException: Did not get replies from all endpoints.
>>          at 
>> org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:721)
>>          at 
>> org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:654)
>>          at 
>> org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:400)
>>          at 
>> org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:279)
>>          at 
>> org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:248)
>>          at 
>> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
>>          at 
>> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
>>          at 
>> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
>>          at 
>> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
>>          at 
>> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
>>          at 
>> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
>>          at 
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>          at 
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>          at 
>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>          at java.base/java.lang.Thread.run(Thread.java:829)
>> INFO  [Repair-Task:437] 2023-08-04 11:56:28,594 RepairRunnable.java:223 
>> - [repair #30675c00-32df-11ee-a7d8-05183c68b0d0]Repair command #522 
>> finished with error
>> 
>> What to do?
>> Thanks!
>> 
>> -Joe
>> 
>> 
>> -- 
>> This email has been checked for viruses by AVG antivirus software.
>> www.avg.com

Reply via email to