Re: partitionLossPolicy confused

38797715 Wed, 12 Oct 2022 06:49:36 -0700

https://issues.apache.org/jira/browse/IGNITE-17835


在 2022/9/30 18:14, Вячеслав Коптилин 写道:

Hello,
In general there are two possible ways to handle lost partitions for acluster that uses Ignite Native Persistence:
1.
   - Return all failed nodes to baseline topology.
   - Call resetLostPartitions

2.
   - Stop all remaining nodes in the cluster.
- Start all nodes in the cluster (including previously failednodes) and activate a cluster.
it’s important to return all failed nodes to the topology beforecalling resetLostPartitions, otherwise a cluster could end up havingstale data.
If some owners cannot be returned to the topology for a some reason,they should be excluded from baseline before attempting resetting lostpartition state or an ClusterTopologyCheckedException will be thrownwith a message "Cannot reset lost partitions because no baseline nodesare online [cache=someCahe, partition=someLostPart]” indicating saferecovery is not possible.
In your particular case, the cache does not have backups and returninga node that holds a lost partition should not lead to datainconsistencies.This particular case can be detected and automatically "resolved". Iwill file a jira ticket in order to address this improvement.
Thanks,
Slava.

пн, 26 сент. 2022 г. в 16:51, 38797715 <[email protected]>:

    hello,

    Start two nodes with native persistent enabled, and then activate it.

    create a table with no backups, sql like follows:

    CREATE TABLE City (
      ID INT,
      Name VARCHAR,
      CountryCode CHAR(3),
      District VARCHAR,
      Population INT,
      PRIMARY KEY (ID, CountryCode)
    ) WITH "template=partitioned, affinityKey=CountryCode,
    CACHE_NAME=City, KEY_TYPE=demo.model.CityKey,
    VALUE_TYPE=demo.model.City";

    INSERT INTO City(ID, Name, CountryCode, District, Population)
    VALUES (1,'Kabul','AFG','Kabol',1780000);
    INSERT INTO City(ID, Name, CountryCode, District, Population)
    VALUES (2,'Qandahar','AFG','Qandahar',237500);

    then execute SELECT COUNT(*) FROM city;

    normal.

    then kill one node.

    then execute SELECT COUNT(*) FROM city;

    Failed to execute query because cache partition has been lostPart
    [cacheName=City, part=0]

    this alse normal.

    Next, start the node that was shut down before.

    then execute SELECT COUNT(*) FROM city;

    Failed to execute query because cache partition has been lostPart
    [cacheName=City, part=0]

    At this time, all partitions have been recovered, and all baseline
    nodes are ONLINE. Why still report this error? It is very
    confusing. Execute reset_lost_partitions operation at this time
    seems redundant. Do have any special considerations here?

    if this time restart the whole cluster,  thenexecute SELECT
    COUNT(*) FROM city; normal, this state is the same as the previous
    state, but the behavior is different.

Re: partitionLossPolicy confused

Reply via email to