https://issues.apache.org/jira/browse/IGNITE-17835

在 2022/9/30 18:14, Вячеслав Коптилин 写道:
Hello,

In general there are two possible ways to handle lost partitions for a cluster that uses Ignite Native Persistence:
1.
   - Return all failed nodes to baseline topology.
   - Call resetLostPartitions

2.
   - Stop all remaining nodes in the cluster.
   - Start all nodes in the cluster (including previously failed nodes) and activate a cluster.

it’s important to return all failed nodes to the topology before calling resetLostPartitions, otherwise a cluster could end up having stale data.

If some owners cannot be returned to the topology for a some reason, they should be excluded from baseline before attempting resetting lost partition state or an ClusterTopologyCheckedException will be thrown with a message "Cannot reset lost partitions because no baseline nodes are online [cache=someCahe, partition=someLostPart]” indicating safe recovery is not possible.

In your particular case, the cache does not have backups and returning a node that holds a lost partition should not lead to data inconsistencies. This particular case can be detected and automatically "resolved". I will file a jira ticket in order to address this improvement.

Thanks,
Slava.

пн, 26 сент. 2022 г. в 16:51, 38797715 <[email protected]>:

    hello,

    Start two nodes with native persistent enabled, and then activate it.

    create a table with no backups, sql like follows:

    CREATE TABLE City (
      ID INT,
      Name VARCHAR,
      CountryCode CHAR(3),
      District VARCHAR,
      Population INT,
      PRIMARY KEY (ID, CountryCode)
    ) WITH "template=partitioned, affinityKey=CountryCode,
    CACHE_NAME=City, KEY_TYPE=demo.model.CityKey,
    VALUE_TYPE=demo.model.City";

    INSERT INTO City(ID, Name, CountryCode, District, Population)
    VALUES (1,'Kabul','AFG','Kabol',1780000);
    INSERT INTO City(ID, Name, CountryCode, District, Population)
    VALUES (2,'Qandahar','AFG','Qandahar',237500);

    then execute SELECT COUNT(*) FROM city;

    normal.

    then kill one node.

    then execute SELECT COUNT(*) FROM city;

    Failed to execute query because cache partition has been lostPart
    [cacheName=City, part=0]

    this alse normal.

    Next, start the node that was shut down before.

    then execute SELECT COUNT(*) FROM city;

    Failed to execute query because cache partition has been lostPart
    [cacheName=City, part=0]

    At this time, all partitions have been recovered, and all baseline
    nodes are ONLINE. Why still report this error? It is very
    confusing. Execute reset_lost_partitions operation at this time
    seems redundant. Do have any special considerations here?

    if this time restart the whole cluster,  thenexecute SELECT
    COUNT(*) FROM city; normal, this state is the same as the previous
    state, but the behavior is different.




Reply via email to