https://issues.apache.org/jira/browse/IGNITE-17835
在 2022/9/30 18:14, Вячеслав Коптилин 写道:
Hello,
In general there are two possible ways to handle lost partitions for a
cluster that uses Ignite Native Persistence:
1.
- Return all failed nodes to baseline topology.
- Call resetLostPartitions
2.
- Stop all remaining nodes in the cluster.
- Start all nodes in the cluster (including previously failed
nodes) and activate a cluster.
it’s important to return all failed nodes to the topology before
calling resetLostPartitions, otherwise a cluster could end up having
stale data.
If some owners cannot be returned to the topology for a some reason,
they should be excluded from baseline before attempting resetting lost
partition state or an ClusterTopologyCheckedException will be thrown
with a message "Cannot reset lost partitions because no baseline nodes
are online [cache=someCahe, partition=someLostPart]” indicating safe
recovery is not possible.
In your particular case, the cache does not have backups and returning
a node that holds a lost partition should not lead to data
inconsistencies.
This particular case can be detected and automatically "resolved". I
will file a jira ticket in order to address this improvement.
Thanks,
Slava.
пн, 26 сент. 2022 г. в 16:51, 38797715 <[email protected]>:
hello,
Start two nodes with native persistent enabled, and then activate it.
create a table with no backups, sql like follows:
CREATE TABLE City (
ID INT,
Name VARCHAR,
CountryCode CHAR(3),
District VARCHAR,
Population INT,
PRIMARY KEY (ID, CountryCode)
) WITH "template=partitioned, affinityKey=CountryCode,
CACHE_NAME=City, KEY_TYPE=demo.model.CityKey,
VALUE_TYPE=demo.model.City";
INSERT INTO City(ID, Name, CountryCode, District, Population)
VALUES (1,'Kabul','AFG','Kabol',1780000);
INSERT INTO City(ID, Name, CountryCode, District, Population)
VALUES (2,'Qandahar','AFG','Qandahar',237500);
then execute SELECT COUNT(*) FROM city;
normal.
then kill one node.
then execute SELECT COUNT(*) FROM city;
Failed to execute query because cache partition has been lostPart
[cacheName=City, part=0]
this alse normal.
Next, start the node that was shut down before.
then execute SELECT COUNT(*) FROM city;
Failed to execute query because cache partition has been lostPart
[cacheName=City, part=0]
At this time, all partitions have been recovered, and all baseline
nodes are ONLINE. Why still report this error? It is very
confusing. Execute reset_lost_partitions operation at this time
seems redundant. Do have any special considerations here?
if this time restart the whole cluster, thenexecute SELECT
COUNT(*) FROM city; normal, this state is the same as the previous
state, but the behavior is different.