Re: "Idle verify" to "Online verify"

2019-05-07 Thread Alexei Scherbakov
I've linked to IGNITE-10078 additional related/dependent tickets requiring investigation as soon as main contribution will be accepted. вт, 7 мая 2019 г. в 18:10, Alexei Scherbakov : > Anton, > > It's ready for review, look for Patch Available status. > Yes, atomic caches are not fixed by this

Re: "Idle verify" to "Online verify"

2019-05-07 Thread Alexei Scherbakov
Anton, It's ready for review, look for Patch Available status. Yes, atomic caches are not fixed by this contribution. See [1] [1] https://issues.apache.org/jira/browse/IGNITE-11797 вт, 7 мая 2019 г. в 17:30, Anton Vinogradov : > Alexei, > > Got it. > Could you please let me know once PR will

Re: "Idle verify" to "Online verify"

2019-05-07 Thread Anton Vinogradov
Alexei, Got it. Could you please let me know once PR will be ready for review? Currently, have some questions, but, possible, they caused by non-final PR (eg. why atomic counter still ignores misses?). On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov < alexey.scherbak...@gmail.com> wrote: >

Re: "Idle verify" to "Online verify"

2019-05-07 Thread Alexei Scherbakov
Anton, 1) Extended counters indeed will answer the question if partition could be safely restored to synchronized state on all owners. The only condition - one of owners has no missed updates. If not, partition must be moved to LOST state, see [1],

Re: "Idle verify" to "Online verify"

2019-05-06 Thread Anton Vinogradov
Ivan, 1) I've checked the PR [1] and it looks like it does not solve the issue too. AFAICS, the main goal here (at PR) is to produce PartitionUpdateCounter#sequential which can be false for all backups, what backup should win in that case? Is there any IEP or some another design page for this

Re: "Idle verify" to "Online verify"

2019-05-06 Thread Ivan Rakov
Anton, Automatic quorum-based partition drop may work as a partial workaround for IGNITE-10078, but discussed approach surely doesn't replace IGNITE-10078 activity. We still don't know what do to when quorum can't be reached (2 partitions have hash X, 2 have hash Y) and keeping extended

Re: "Idle verify" to "Online verify"

2019-05-06 Thread Anton Vinogradov
Ivan, just to make sure ... The discussed case will fully solve the issue [1] in case we'll also add some strategy to reject partitions with missed updates (updateCnt==Ok, Hash!=Ok). For example, we may use the Quorum strategy, when the majority wins. Sounds correct? [1]

Re: "Idle verify" to "Online verify"

2019-04-30 Thread Anton Vinogradov
Ivan, Thanks for the detailed explanation. I'll try to implement the PoC to check the idea. On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov wrote: > > But how to keep this hash? > I think, we can just adopt way of storing partition update counters. > Update counters are: > 1) Kept and updated in

Re: "Idle verify" to "Online verify"

2019-04-29 Thread Ivan Rakov
But how to keep this hash? I think, we can just adopt way of storing partition update counters. Update counters are: 1) Kept and updated in heap, see IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during regular cache operations, no page replacement latency issues) 2)

Re: "Idle verify" to "Online verify"

2019-04-29 Thread Anton Vinogradov
Ivan, thanks for the analysis! >> With having pre-calculated partition hash value, we can automatically detect inconsistent partitions on every PME. Great idea, seems this covers all broken synс cases. It will check alive nodes in case the primary failed immediately and will check rejoining node

Re: "Idle verify" to "Online verify"

2019-04-29 Thread Ivan Rakov
Hi Anton, Thanks for sharing your ideas. I think your approach should work in general. I'll just share my concerns about possible issues that may come up. 1) Equality of update counters doesn't imply equality of partitions content under load. For every update, primary node generates update

"Idle verify" to "Online verify"

2019-04-29 Thread Anton Vinogradov
Igniters and especially Ivan Rakov, "Idle verify" [1] is a really cool tool, to make sure that cluster is consistent. 1) But it required to have operations paused during cluster check. At some clusters, this check requires hours (3-4 hours at cases I saw). I've checked the code of "idle verify"