Hi Kristian, Could you share the source of a class that has inconsistent equals/hashCode implementation? Probably we will be able to detect your case internally somehow and print a warning.
— Denis > On Jun 17, 2016, at 10:27 PM, Kristian Rosenvold <[email protected]> > wrote: > > This whole issue was caused by inconsistent equals/hashCode on a cache > key, which appearantly has the capability of stopping replication dead > in its tracks. Nailing this one after 3-4 days of a very nagging > "select is broken" feeling was great. You guys helping us here might > want to be particularly aware of this, since it undeniably gives a newbie an > impression that ignite is broken while it's my code :) > > Thanks for the help ! > > Kristian > > > 2016-06-17 20:00 GMT+02:00 Alexey Goncharuk <[email protected]>: >> Kristian, >> >> Are you sure you are using the latest 1.7-SNAPSHOT for your production data? >> Did you build binaries yourself? Can you confirm the commit# of the binaries >> you are using? The issue you are reporting seems to be the same as >> IGNITE-3305 and, since the fix was committed only a couple of days ago, it >> might not get to nightly snapshot. >> >> 2016-06-17 9:06 GMT-07:00 Kristian Rosenvold <[email protected]>: >>> >>> Sigh, this has all the hallmarks of a thread safety issue or race >>> condition. >>> >>> I had a perfect testcase that replicated the problem 100% of the time, >>> but only when running on distinct nodes (never occurs on same box) >>> with 2 distinct caches and with ignite 1.5; I just expanded the >>> testcase I posted initially . Typically I'd be missing the last 10-20 >>> elements in the cache. I was about 2 seconds from reporting an issue >>> and then I switched to yesterday's 1.7-SNAPSHOT version and it went >>> away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my >>> production data, it just broke my testcase :( Assumably I just need to >>> tweak the cache sizes or element counts to hit some kind of non-sweet >>> spot, and then it probably fails on my machine. >>> >>> The testcase always worked on a single box, which lead me to think >>> about socket-related issues. But it also required 2 caches to fail, >>> which lead me to think about race conditions like the rebalance >>> terminating once the first node finishes. >>> >>> I'm no stranger to reading bug reports like this myself, and I must >>> admit this seems pretty tough to diagnose. >>> >>> Kristian >>> >>> >>> 2016-06-17 14:57 GMT+02:00 Denis Magda <[email protected]>: >>>> Hi Kristian, >>>> >>>> Your test looks absolutely correct for me. However I didn’t manage to >>>> reproduce this issue on my side as well. >>>> >>>> Alex G., do you have any ideas on what can be a reason of that? Can you >>>> recommend Kristian enabling of DEBUG/TRACE log levels for particular >>>> modules? Probably advanced logging will let us to pin point the issue >>>> that >>>> happens in Kristian’s environment. >>>> >>>> — >>>> Denis >>>> >>>> On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <[email protected]> >>>> wrote: >>>> >>>> For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since >>>> REPLICATED caches seem to be broken on 1.6 and beyond, I am testing >>>> this on 1.5: >>>> >>>> I can reliably start two nodes and get consistent correct results, >>>> lets say each node has 1.5 million elements in a given cache. >>>> >>>> Once I start a third or fourth node in the same cluster, it >>>> consistently gets a random incorrect number of elements in the same >>>> cache, typically 1.1 million or so. >>>> >>>> I tried to create a testcase to reproduce this on my local machine >>>> >>>> (https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95), >>>> but this fails to reproduce the problem. >>>> >>>> I have two nodes in 2 different datacenters, so there will invariably >>>> be some differences in latencies/response times between the existing 2 >>>> nodes and the newly started node. >>>> >>>> This sounds like some kind of timing related bug, any tips ? Is there >>>> any way I kan skew the timing in the testcase ? >>>> >>>> >>>> Kristian >>>> >>>> >> >>
