Re: Why wpuld a client node error cause server node to shut off?

Stephen Darlington Wed, 01 Nov 2023 02:14:53 -0700

There are lots of "throttling" warnings. It could be as simple as your
cluster is at its limit. Faster or more disks might help, as might scaling
out. The other is that you've enabled write throttling.
Counter-intuitively, you might want to *dis*able that. It'll still do write
throttling, just using a different algorithm.


On Tue, 31 Oct 2023 at 15:35, John Smith <java.dev....@gmail.com> wrote:

> I understand you have no time and I have also followed that link. My nodes
> are 32GB and I have allocated 8GB for heap and some for off-heap. So I'm
> def not hitting some ceiling where it needs to try to force some huge
> garbage collection.
>
> What 'i'm asking based on the config and stats I gave do you see anything
> that sticks out in those configs not the logs?
>
> On Tue, Oct 31, 2023 at 10:42 AM Stephen Darlington <
> sdarling...@apache.org> wrote:
>
>> No, sorry, the issue is that I don't have the time to go through 25,000
>> lines of log file. As I said, your cluster had network or long JVM pause
>> issues, probably the latter:
>>
>> [21:37:12,517][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>> Possible too long JVM pause: 63356 milliseconds.
>>
>> When nodes are continually talking to one another, no Ignite code being
>> executed for over a minute is going to be a *big* problem. You need to
>> tune your JVM. There are some hints in the documentation:
>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning
>>
>>
>> On Tue, 31 Oct 2023 at 13:16, John Smith <java.dev....@gmail.com> wrote:
>>
>>> Does any of this infor help? I included what we do more or less plus
>>> stats and configs.
>>>
>>> There are 9 caches of which the biggest one is 5 million records
>>> (partitioned with 1 backup), the key is String (11 chars) and the value
>>> integer.
>>>
>>> The rest are replicated and some partitioned but max a few thousand
>>> records at best.
>>>
>>> The nodes are 32GB here is the output of the free -m
>>>
>>>               total        used        free      shared  buff/cache
>>> available
>>> Mem:          32167        2521       26760           0        2885
>>>   29222
>>> Swap:          2047           0        2047
>>>
>>> And here is node stats:
>>>
>>> Time of the snapshot: 2023-10-31 13:08:56
>>>
>>> +---------------------------------------------------------------------------------+
>>> | ID                          | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e
>>>          |
>>> | ID8                         | E8044C1A
>>>          |
>>> | Consistent ID               | b14350a9-6963-442c-9529-14f70f95a6d9
>>>          |
>>> | Node Type                   | Server
>>>          |
>>> | Order                       | 2660
>>>          |
>>> | Address (0)                 | xxxxxx
>>>     |
>>> | Address (1)                 | 127.0.0.1
>>>           |
>>> | Address (2)                 | 0:0:0:0:0:0:0:1%lo
>>>          |
>>> | OS info                     | Linux amd64 4.15.0-197-generic
>>>          |
>>> | OS user                     | ignite
>>>          |
>>> | Deployment mode             | SHARED
>>>          |
>>> | Language runtime            | Java Platform API Specification ver. 1.8
>>>          |
>>> | Ignite version              | 2.12.0
>>>          |
>>> | Ignite instance name        | xxxxxx
>>>           |
>>> | JRE information             | HotSpot 64-Bit Tiered Compilers
>>>           |
>>> | JVM start time              | 2023-09-29 14:50:39
>>>           |
>>> | Node start time             | 2023-09-29 14:54:34
>>>           |
>>> | Up time                     | 09:28:57.946
>>>          |
>>> | CPUs                        | 4
>>>           |
>>> | Last metric update          | 2023-10-31 13:07:49
>>>           |
>>> | Non-loopback IPs            | xxxxxx, xxxxxx |
>>> | Enabled MACs                | xxxxxx
>>>     |
>>> | Maximum active jobs         | 1
>>>           |
>>> | Current active jobs         | 0
>>>           |
>>> | Average active jobs         | 0.01
>>>          |
>>> | Maximum waiting jobs        | 0
>>>           |
>>> | Current waiting jobs        | 0
>>>           |
>>> | Average waiting jobs        | 0.00
>>>          |
>>> | Maximum rejected jobs       | 0
>>>           |
>>> | Current rejected jobs       | 0
>>>           |
>>> | Average rejected jobs       | 0.00
>>>          |
>>> | Maximum cancelled jobs      | 0
>>>           |
>>> | Current cancelled jobs      | 0
>>>           |
>>> | Average cancelled jobs      | 0.00
>>>          |
>>> | Total rejected jobs         | 0
>>>           |
>>> | Total executed jobs         | 2
>>>           |
>>> | Total cancelled jobs        | 0
>>>           |
>>> | Maximum job wait time       | 0ms
>>>           |
>>> | Current job wait time       | 0ms
>>>           |
>>> | Average job wait time       | 0.00ms
>>>          |
>>> | Maximum job execute time    | 11ms
>>>          |
>>> | Current job execute time    | 0ms
>>>           |
>>> | Average job execute time    | 5.50ms
>>>          |
>>> | Total busy time             | 5733919ms
>>>           |
>>> | Busy time %                 | 0.21%
>>>           |
>>> | Current CPU load %          | 1.93%
>>>           |
>>> | Average CPU load %          | 4.35%
>>>           |
>>> | Heap memory initialized     | 504mb
>>>           |
>>> | Heap memory used            | 310mb
>>>           |
>>> | Heap memory committed       | 556mb
>>>           |
>>> | Heap memory maximum         | 8gb
>>>           |
>>> | Non-heap memory initialized | 2mb
>>>           |
>>> | Non-heap memory used        | 114mb
>>>           |
>>> | Non-heap memory committed   | 119mb
>>>           |
>>> | Non-heap memory maximum     | 0
>>>           |
>>> | Current thread count        | 125
>>>           |
>>> | Maximum thread count        | 140
>>>           |
>>> | Total started thread count  | 409025
>>>          |
>>> | Current daemon thread count | 15
>>>          |
>>>
>>> +---------------------------------------------------------------------------------+
>>>
>>> Data region metrics:
>>>
>>> +==========================================================================================================================+
>>> |       Name       | Page size |       Pages        |    Memory     |
>>>    Rates       | Checkpoint buffer | Large entries |
>>>
>>> +==========================================================================================================================+
>>> | Default_Region   | 0         | Total:  307665     | Total:  1gb   |
>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>> Eviction:   0.00 | Size:  0          |               |
>>> |                  |           | Memory: 0          |               |
>>> Replace:    0.00 |                   |               |
>>> |                  |           | Fill factor: 0.00% |               |
>>>                |                   |               |
>>>
>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>> | metastoreMemPlc  | 0         | Total:  57         | Total:  228kb |
>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>> Eviction:   0.00 | Size:  0          |               |
>>> |                  |           | Memory: 0          |               |
>>> Replace:    0.00 |                   |               |
>>> |                  |           | Fill factor: 0.00% |               |
>>>                |                   |               |
>>>
>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>> | sysMemPlc        | 0         | Total:  5          | Total:  20kb  |
>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>> Eviction:   0.00 | Size:  0          |               |
>>> |                  |           | Memory: 0          |               |
>>> Replace:    0.00 |                   |               |
>>> |                  |           | Fill factor: 0.00% |               |
>>>                |                   |               |
>>>
>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>> | TxLog            | 0         | Total:  0          | Total:  0     |
>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>> Eviction:   0.00 | Size:  0          |               |
>>> |                  |           | Memory: 0          |               |
>>> Replace:    0.00 |                   |               |
>>> |                  |           | Fill factor: 0.00% |               |
>>>                |                   |               |
>>>
>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>> | volatileDsMemPlc | 0         | Total:  0          | Total:  0     |
>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>> Eviction:   0.00 | Size:  0          |               |
>>> |                  |           | Memory: 0          |               |
>>> Replace:    0.00 |                   |               |
>>> |                  |           | Fill factor: 0.00% |               |
>>>                |                   |               |
>>>
>>> +--------------------------------------------------------------------------------------------------------------------------+
>>>
>>> Server nodes config...
>>>
>>> if [ -z "$JVM_OPTS" ] ; then
>>>     JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m"
>>> fi
>>>
>>> #
>>> # Uncomment the following GC settings if you see spikes in your
>>> throughput due to Garbage Collection.
>>> #
>>> # JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
>>> JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC
>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
>>> -XX:MaxDirectMemorySize=256m"
>>>
>>> And we use this as our persistence config...
>>>
>>>       <property name="dataStorageConfiguration">
>>>         <bean
>>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>>           <property name="writeThrottlingEnabled" value="true"/>
>>>
>>>           <!-- Redefining the default region's settings -->
>>>           <property name="defaultDataRegionConfiguration">
>>>             <bean
>>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>>               <property name="persistenceEnabled" value="true"/>
>>>
>>>               <property name="name" value="Default_Region"/>
>>>               <property name="maxSize" value="#{10L * 1024 * 1024 *
>>> 1024}"/>
>>>             </bean>
>>>           </property>
>>>         </bean>
>>>       </property>
>>>
>>> On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington <
>>> sdarling...@apache.org> wrote:
>>>
>>>> There's a lot going on in that log file. It makes it difficult to tell
>>>> what *the* issue is. You have lots of nodes leaving (and joining) the
>>>> cluster, including server nodes. You have lost partitions and long JVM
>>>> pauses. I suspect the real cause of this node shutting down was that it
>>>> became segmented.
>>>>
>>>> Chances are the issue is either a genuine network issue or the long JVM
>>>> pauses -- which means that the nodes are not talking to each other --
>>>> caused the cluster to fall apart.
>>>>
>>>

Re: Why wpuld a client node error cause server node to shut off?

Reply via email to