Hi Kristian,

Could you share the source of a class that has inconsistent equals/hashCode 
implementation? Probably we will be able to detect your case internally somehow 
and print a warning.

—
Denis

> On Jun 17, 2016, at 10:27 PM, Kristian Rosenvold <[email protected]> 
> wrote:
> 
> This whole issue was caused by inconsistent equals/hashCode on a cache
> key, which appearantly has the capability of stopping replication dead
> in its tracks. Nailing this one after 3-4 days of a very nagging
> "select is broken" feeling was great. You guys helping us here might
> want to be particularly aware of this, since it undeniably gives a newbie an
> impression that ignite is broken while it's my code :)
> 
> Thanks for the help !
> 
> Kristian
> 
> 
> 2016-06-17 20:00 GMT+02:00 Alexey Goncharuk <[email protected]>:
>> Kristian,
>> 
>> Are you sure you are using the latest 1.7-SNAPSHOT for your production data?
>> Did you build binaries yourself? Can you confirm the commit# of the binaries
>> you are using? The issue you are reporting seems to be the same as
>> IGNITE-3305 and, since the fix was committed only a couple of days ago, it
>> might not get to nightly snapshot.
>> 
>> 2016-06-17 9:06 GMT-07:00 Kristian Rosenvold <[email protected]>:
>>> 
>>> Sigh, this has all the hallmarks of a thread safety issue or race
>>> condition.
>>> 
>>> I had a perfect testcase that replicated the problem 100% of the time,
>>> but only when running on distinct nodes (never occurs on same box)
>>> with 2 distinct caches and with ignite 1.5; I just expanded the
>>> testcase I posted initially . Typically I'd be missing the last 10-20
>>> elements in the cache. I was about 2 seconds from reporting an issue
>>> and then I switched to yesterday's 1.7-SNAPSHOT version and it went
>>> away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my
>>> production data, it just broke my testcase :( Assumably I just need to
>>> tweak the cache sizes or element counts to hit some kind of non-sweet
>>> spot, and then it probably fails on my machine.
>>> 
>>> The testcase always worked on a single box, which lead me to think
>>> about socket-related issues. But it also required 2 caches to fail,
>>> which lead me to think about race conditions like the rebalance
>>> terminating once the first node finishes.
>>> 
>>> I'm no stranger to reading bug reports like this myself, and I must
>>> admit this seems pretty tough to diagnose.
>>> 
>>> Kristian
>>> 
>>> 
>>> 2016-06-17 14:57 GMT+02:00 Denis Magda <[email protected]>:
>>>> Hi Kristian,
>>>> 
>>>> Your test looks absolutely correct for me. However I didn’t manage to
>>>> reproduce this issue on my side as well.
>>>> 
>>>> Alex G., do you have any ideas on what can be a reason of that? Can you
>>>> recommend Kristian enabling of DEBUG/TRACE log levels for particular
>>>> modules? Probably advanced logging will let us to pin point the issue
>>>> that
>>>> happens in Kristian’s environment.
>>>> 
>>>> —
>>>> Denis
>>>> 
>>>> On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <[email protected]>
>>>> wrote:
>>>> 
>>>> For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since
>>>> REPLICATED caches seem to be broken on 1.6 and beyond, I am testing
>>>> this on 1.5:
>>>> 
>>>> I can reliably start two nodes and get consistent correct results,
>>>> lets say each node has 1.5 million elements in a given cache.
>>>> 
>>>> Once I start a third or fourth node in the same cluster, it
>>>> consistently gets a random incorrect number of elements in the same
>>>> cache, typically 1.1 million or so.
>>>> 
>>>> I tried to create a testcase to reproduce this on my local machine
>>>> 
>>>> (https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95),
>>>> but this fails to reproduce the problem.
>>>> 
>>>> I have two nodes in 2 different datacenters, so there will invariably
>>>> be some differences in latencies/response times between the existing 2
>>>> nodes and the newly started node.
>>>> 
>>>> This sounds like some kind of timing related bug, any tips ? Is there
>>>> any way I kan skew the timing in the testcase ?
>>>> 
>>>> 
>>>> Kristian
>>>> 
>>>> 
>> 
>> 

Reply via email to