Sigh, this has all the hallmarks of a thread safety issue or race condition.
I had a perfect testcase that replicated the problem 100% of the time, but only when running on distinct nodes (never occurs on same box) with 2 distinct caches and with ignite 1.5; I just expanded the testcase I posted initially . Typically I'd be missing the last 10-20 elements in the cache. I was about 2 seconds from reporting an issue and then I switched to yesterday's 1.7-SNAPSHOT version and it went away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my production data, it just broke my testcase :( Assumably I just need to tweak the cache sizes or element counts to hit some kind of non-sweet spot, and then it probably fails on my machine. The testcase always worked on a single box, which lead me to think about socket-related issues. But it also required 2 caches to fail, which lead me to think about race conditions like the rebalance terminating once the first node finishes. I'm no stranger to reading bug reports like this myself, and I must admit this seems pretty tough to diagnose. Kristian 2016-06-17 14:57 GMT+02:00 Denis Magda <[email protected]>: > Hi Kristian, > > Your test looks absolutely correct for me. However I didn’t manage to > reproduce this issue on my side as well. > > Alex G., do you have any ideas on what can be a reason of that? Can you > recommend Kristian enabling of DEBUG/TRACE log levels for particular > modules? Probably advanced logging will let us to pin point the issue that > happens in Kristian’s environment. > > — > Denis > > On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <[email protected]> > wrote: > > For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since > REPLICATED caches seem to be broken on 1.6 and beyond, I am testing > this on 1.5: > > I can reliably start two nodes and get consistent correct results, > lets say each node has 1.5 million elements in a given cache. > > Once I start a third or fourth node in the same cluster, it > consistently gets a random incorrect number of elements in the same > cache, typically 1.1 million or so. > > I tried to create a testcase to reproduce this on my local machine > (https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95), > but this fails to reproduce the problem. > > I have two nodes in 2 different datacenters, so there will invariably > be some differences in latencies/response times between the existing 2 > nodes and the newly started node. > > This sounds like some kind of timing related bug, any tips ? Is there > any way I kan skew the timing in the testcase ? > > > Kristian > >
