[ 
https://issues.apache.org/jira/browse/IGNITE-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Khitrin updated IGNITE-19419:
------------------------------------
    Summary: Internal queue overload under intensive SQL load  (was: Corrupted 
LogStorage under intensive SQL load)

> Internal queue overload under intensive SQL load
> ------------------------------------------------
>
>                 Key: IGNITE-19419
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19419
>             Project: Ignite
>          Issue Type: Bug
>          Components: sql
>    Affects Versions: 3.0.0-beta1
>            Reporter: Andrey Khitrin
>            Priority: Major
>              Labels: ignite-3, storage-engine
>         Attachments: ai3.jstack, logs.zip
>
>
> I tried to run a SQL benchmark (slightly modified [Benchbase TPC-C 
> scenario|https://github.com/cmu-db/benchbase/wiki/TPC-C]) against Ignite3 
> (commit hash b48ddcba7cd2bd3b9a053ae131c25b44a0400e27).
> Ignite3 was running in a single-node mode with -Xmx8G -Xms8G memory limit 
> (increased memory size comparing to IGNITE-19412). After some time it has 
> become unavailable to the client. Log is filled with error messages like the 
> following:
>  
> {code:java}
> 2023-05-03 17:51:57:705 +0300 
> [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][NodeImpl] Node 
> <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> got error: Error 
> [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk 
> queue overload.]].
> 2023-05-03 17:51:57:705 +0300 
> [WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable 
> error during the request type=ActionRequestImpl occurred (will be retried on 
> the randomly selected node): 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.TimeoutException
>       ... 7 more
> 2023-05-03 17:51:57:705 +0300 
> [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
> <e92d1429-965a-4cbd-af37-126844f352ca_part_5/node0> append [2268, 2268] 
> failed, status=Status[EIO<1014>: Corrupted LogStorage].
> 2023-05-03 17:51:57:705 +0300 
> [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][FSMCallerImpl] 
> FSMCaller already in error status, ignore new error
> Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, 
> disk queue overload.]]
>       at 
> org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:587)
>       at 
> org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:337)
>       at 
> org.apache.ignite.raft.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1621)
>       at 
> org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:305)
>       at 
> org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:288)
>       at 
> org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:217)
>       at 
> org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:181)
>       at 
> com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> 2023-05-03 17:51:57:706 +0300 
> [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][ReplicatorGroupImpl] 
> Fail to find the next candidate.
> 2023-05-03 17:51:57:706 +0300 
> [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
> <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> append [2271, 2271] 
> failed, status=Status[EIO<1014>: Corrupted LogStorage].
> 2023-05-03 17:51:57:706 +0300 
> [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][StateMachineAdapter] 
> onLeaderStop: status=Status[EBADNODE<10009>: Raft node(leader or candidate) 
> is in error.].
> 2023-05-03 17:51:57:706 +0300 
> [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
> <30181391-2f04-40c0-8c35-a77592b0731f_part_5/node0> append [2285, 2285] 
> failed, status=Status[EIO<1014>: Corrupted LogStorage].
> {code}
> More errors could be observed in the attached log archive.
> Actually the log storage is not corrupted but the queue is overloaded. There 
> are some performance improvements in main JRaft repository ( 
> [https://github.com/sofastack/sofa-jraft] ) that may fix that, more exactly 
> it is commit 5de2fbbcabb70ddbefb06a1d3737821781c3e85c in sofa-jraft repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to