Re: Cache was inconsistent state

Evgenii Zhuravlev Mon, 11 May 2020 09:24:18 -0700

John,

Yes, client nodes should have this parameter too.


Evgenii

пн, 11 мая 2020 г. в 07:54, John Smith <[email protected]>:

> I mean both the prefer IPV4 and the Zookeeper discovery should be on the
> "central" cluster as well as all nodes specifically marked as client = true?
>
> On Mon, 11 May 2020 at 09:59, John Smith <[email protected]> wrote:
>
>> Should be on client nodes as well that are specifically setClient = true?
>>
>> On Fri, 8 May 2020 at 22:26, Evgenii Zhuravlev <[email protected]>
>> wrote:
>>
>>> John,
>>>
>>> It looks like a split-brain. They were in one cluster at first. I'm not
>>> sure what was the reason for this, it could be a network problem or
>>> something else.
>>>
>>> I saw in logs that you use both ipv4 and ipv6, I would recommend using
>>> only one of them to avoid problems - just add 
>>> -Djava.net.preferIPv4Stack=true
>>> to all nodes in the cluster.
>>>
>>> Also, to avoid split-brain situations, you can use Zookeeper Discovery:
>>> https://apacheignite.readme.io/docs/zookeeper-discovery#failures-and-split-brain-handling
>>>  or
>>> implement Segmentation resolver. More information about the second can be
>>> found on the forum, for example, here:
>>> http://apache-ignite-users.70518.x6.nabble.com/split-brain-problem-and-GridSegmentationProcessor-td14590.html
>>>
>>> Evgenii
>>>
>>> пт, 8 мая 2020 г. в 14:30, John Smith <[email protected]>:
>>>
>>>> How though? It's the same cluster! We haven't changed anything
>>>> this happened on it's own...
>>>>
>>>> All I did was reboot the node and the cluster fixed itself.
>>>>
>>>> On Fri, 8 May 2020 at 15:32, Evgenii Zhuravlev <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> *Yes, it looks like they are in a different clusters:*
>>>>> *Metrics from the node with a problem:*
>>>>> [15:17:28,668][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>>>
>>>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>>>     ^-- Node [id=5bbf262e, name=xxxxxx, uptime=93 days, 19:36:10.921]
>>>>>     ^-- H/N/C [hosts=3, nodes=4, CPUs=10]
>>>>>
>>>>> *Metrics from another node:*
>>>>> [15:17:05,635][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>>>
>>>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>>>     ^-- Node [id=dddefdcd, name=xxxxxx, uptime=19 days, 16:49:48.381]
>>>>>     ^-- H/N/C [hosts=6, nodes=7, CPUs=21]
>>>>>
>>>>> *The same topology versions for 2 nodes has different nodes:*
>>>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>> Topology snapshot [ver=1036, locNode=5bbf262e, servers=1, clients=3,
>>>>> state=ACTIVE, CPUs=10, offheap=10.0GB, heap=13.0GB]
>>>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>>   ^-- Baseline [id=0, size=3, online=1, offline=2]
>>>>>
>>>>> *And*
>>>>>
>>>>> [03:56:43,388][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>> Topology snapshot [ver=1036, locNode=4394fdd4, servers=2, clients=2,
>>>>> state=ACTIVE, CPUs=15, offheap=20.0GB, heap=19.0GB]
>>>>> [03:56:43,389][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>>   ^-- Baseline [id=0, size=3, online=2, offline=1]
>>>>>
>>>>> So, it's just 2 different clusters.
>>>>>
>>>>> Best Regards,
>>>>> Evgenii
>>>>>
>>>>> пт, 8 мая 2020 г. в 08:50, John Smith <[email protected]>:
>>>>>
>>>>>> Hi Evgenii, here the logs.
>>>>>>
>>>>>> https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0
>>>>>>
>>>>>> On Fri, 8 May 2020 at 09:21, John Smith <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok let me try get them...
>>>>>>>
>>>>>>> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> It looks like the third server node was not a part of this cluster
>>>>>>>> before restart. Can you share full logs from all server nodes?
>>>>>>>>
>>>>>>>> Evgenii
>>>>>>>>
>>>>>>>> чт, 7 мая 2020 г. в 09:11, John Smith <[email protected]>:
>>>>>>>>
>>>>>>>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>>>>>>>
>>>>>>>>> I checked the state of the cluster by going
>>>>>>>>> to: /ignite?cmd=currentState
>>>>>>>>> And the response was: 
>>>>>>>>> {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>>>>>>>> I also checked: /ignite?cmd=size&cacheName=....
>>>>>>>>>
>>>>>>>>> 2 nodes where reporting 3 million records
>>>>>>>>> 1 node was reporting 2 million records.
>>>>>>>>>
>>>>>>>>> When I connected to visor and ran the node command... The details
>>>>>>>>> where wrong as it only showed 2 server nodes and only 1 client, but 3
>>>>>>>>> server nodes actually exist and more clients are connected.
>>>>>>>>>
>>>>>>>>> So I rebooted the node that was claiming 2 million records instead
>>>>>>>>> of 3 and when I re-ran the node command displayed all the proper 
>>>>>>>>> nodes.
>>>>>>>>> Also after the reboot all the nodes started reporting 2 million
>>>>>>>>> records instead of 3 million so there some sort of rebalancing or
>>>>>>>>> correction (the cache has a 90 day TTL)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Before reboot
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>>>       | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4
>>>>>>>>>  | 1.27 %   | 84.00 %   |
>>>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3
>>>>>>>>>  | 0.67 %   | 74.00 %   |
>>>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4
>>>>>>>>>  | 6.57 %   | 84.00 %   |
>>>>>>>>>
>>>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>>>
>>>>>>>>> After reboot
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>>>       | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4
>>>>>>>>>  | 0.77 %   | 56.00 %   |
>>>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3
>>>>>>>>>  | 0.77 %   | 56.00 %   |
>>>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4
>>>>>>>>>  | 1.00 %   | 60.00 %   |
>>>>>>>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>>>  | 4.10 %   | 56.00 %   |
>>>>>>>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>>>  | 3.93 %   | 56.00 %   |
>>>>>>>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2
>>>>>>>>>  | 0.67 %   | 91.00 %   |
>>>>>>>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4
>>>>>>>>>  | 1.00 %   | 97.00 %   |
>>>>>>>>>
>>>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>>>
>>>>>>>>

Re: Cache was inconsistent state

Reply via email to