Thanks Igor for the suggestions.
But, In the log JVM pause is not showing too much of time consumption(around
500 milliseconds) and I assume it is not the only problem of time taking. There
are other reasons I expect for this strange behavior.
Even I tried with WAL disabled, still the time is going high after few files
import.
Is there any way to figure out why so many THREAD WAIT are happening. Because
of it there are SEVERE errors logging in the ignite nodes. Below are some
sample errors from the log file I shared. Please help me here.
[05:02:45,413][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical
thread has been detected. This can lead to cluster-wide undefined behaviour
[threadName=data-streamer-stripe-0, blockedFor=34s]
[05:02:45,414][WARNING][tcp-disco-msg-worker-#2][G] Thread
[name="data-streamer-stripe-0-#9", id=22, state=RUNNABLE, blockCnt=0,
waitCnt=4375]
[05:02:45,415][SEVERE][tcp-disco-msg-worker-#2][] Critical system error
detected. Will be handled accordingly to configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=data-streamer-stripe-0, igniteInstanceName=null, finished=false,
heartbeatTs=1569214931358]]]
class org.apache.ignite.IgniteException: GridWorker
[name=data-streamer-stripe-0, igniteInstanceName=null, finished=false,
heartbeatTs=1569214931358]
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
at
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
at
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Thread [name="sys-#367", id=468, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a4b4b27,
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Thread [name="sys-#366", id=467, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a4b4b27,
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748).
Regards,
Favas
From: Igor Belyakov <[email protected]>
Sent: Monday, September 23, 2019 7:37 PM
To: [email protected]
Subject: Re: Ignite bulk data load issue
Hi,
According to the logs, I see a lot of "Possible too long JVM pause" messages,
seems like you have a problem with GC activity. You should check GC logs and
try to tune GC following this manual:
https://apacheignite.readme.io/docs/jvm-and-system-tuning
Also, it's suggested to disable WAL during initial data loading, to reduce
amount of disk IO operations. Check "Performance tips" in the bottom of the
next page for more details:
https://apacheignite.readme.io/docs/data-loading
Regards,
Igor
On Mon, Sep 23, 2019 at 1:07 PM Muhammed Favas
<[email protected]<mailto:[email protected]>>
wrote:
Hi,
I need help to figure out the issues identified during the bulk load of data
into ignite cluster.
My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk.
Also ignite native persistence is enabled for the table.
I am trying to load data into my ignite SQL table from csv file using COPY
command. Each file consist of 50 Million record and I have numerous file of
same size. During the initial time of loading, it was quite fast, but after
some time the data load become very slow and now it is taking hours to load
even a single file. Below is my observations
* When I trigger the load first time after a pause, the CPU usage shows in
promising level and that time data load is in higher rate.
* After loading 2-3 file, the CPU starts dropping down to less than 1 % and
it continue in that state for ever.
* Then I stop the loading processes for some time and re-start, it again
perform well and after some time the same situation happens.
When I checked the log file, I saw certain THREAD WAIT are happening, I believe
due to this waits, the CPU is dropping down. I have attached the entire log
file.
Can some one help me to figure out why it so in ignite? Or is it something I
have made wrong in my configuration. Below is my configuration file content
<bean id="ignite.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="failureDetectionTimeout" value="30000"/>
<!-- Redefining maximum memory size for the cluster node usage. -->
<property name="dataStorageConfiguration">
<bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="CheckpointReadLockTimeout" value="0" />
<!-- Set the page size to 4 KB -->
<property name="pageSize" value="#{4 * 1024}"/>
<!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->
<property name="walSegmentSize" value="#{1L * 1024 * 1024 *
1024}"/>
<!-- Set the wal segment to 5. Default was 10 -->
<property name="walSegments" value="5"/>
<!-- Set the wal segment history size to 5. Default was 20 -->
<property name="WalHistorySize" value="5"/>
<property name="walCompactionEnabled" value="true" />
<property name="walCompactionLevel" value="6" />
<!-- Enable write throttling. -->
<property name="writeThrottlingEnabled" value="true"/>
<!-- Redefining the default region's settings -->
<property name="defaultDataRegionConfiguration">
<bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="name" value="Default_Region"/>
<!-- Setting the size of the default region to 24GB. -->
<property name="maxSize" value="#{24L * 1024 * 1024 *
1024}"/>
<!-- Increasing the check point buffer size to 2 GB. -->
<property name="checkpointPageBufferSize" value="#{2L *
1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<!-- Partitioned cache example configuration -->
<bean
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="default*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="atomicityMode" value="TRANSACTIONAL"/>
<property name="queryParallelism" value="8" />
</bean>
</list>
</property>
Regards,
Favas