Re: Adding a third node to REPLICATED cluster fails to get correct number of elements

Kristian Rosenvold Thu, 23 Jun 2016 01:30:45 -0700

We think the issue may regard transportability of the hashCode across
nodes, because the hashcode in question included the hashcode of a
class (in other words this.getClass().hashCode() as opposed to the
more robust this.getClass().getName().hashCode())


Does ignite require the hashCode of a key to be cluster-wide consistent ?

(This would actually be a violation of the javadoc contract for
hashcode, which states "This integer need not remain consistent from
one execution of an application to another execution of the same
application.". But it should possible to actually test for this if it
is a constraint required by ignite.)

If this does not appear to be the problem, I can supply the code in question.

Kristian



2016-06-23 10:05 GMT+02:00 Denis Magda <[email protected]>:
> Hi Kristian,
>
> Could you share the source of a class that has inconsistent equals/hashCode 
> implementation? Probably we will be able to detect your case internally 
> somehow and print a warning.
>
> —
> Denis
>
>> On Jun 17, 2016, at 10:27 PM, Kristian Rosenvold <[email protected]> 
>> wrote:
>>
>> This whole issue was caused by inconsistent equals/hashCode on a cache
>> key, which appearantly has the capability of stopping replication dead
>> in its tracks. Nailing this one after 3-4 days of a very nagging
>> "select is broken" feeling was great. You guys helping us here might
>> want to be particularly aware of this, since it undeniably gives a newbie an
>> impression that ignite is broken while it's my code :)
>>
>> Thanks for the help !
>>
>> Kristian
>>
>>
>> 2016-06-17 20:00 GMT+02:00 Alexey Goncharuk <[email protected]>:
>>> Kristian,
>>>
>>> Are you sure you are using the latest 1.7-SNAPSHOT for your production data?
>>> Did you build binaries yourself? Can you confirm the commit# of the binaries
>>> you are using? The issue you are reporting seems to be the same as
>>> IGNITE-3305 and, since the fix was committed only a couple of days ago, it
>>> might not get to nightly snapshot.
>>>
>>> 2016-06-17 9:06 GMT-07:00 Kristian Rosenvold <[email protected]>:
>>>>
>>>> Sigh, this has all the hallmarks of a thread safety issue or race
>>>> condition.
>>>>
>>>> I had a perfect testcase that replicated the problem 100% of the time,
>>>> but only when running on distinct nodes (never occurs on same box)
>>>> with 2 distinct caches and with ignite 1.5; I just expanded the
>>>> testcase I posted initially . Typically I'd be missing the last 10-20
>>>> elements in the cache. I was about 2 seconds from reporting an issue
>>>> and then I switched to yesterday's 1.7-SNAPSHOT version and it went
>>>> away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my
>>>> production data, it just broke my testcase :( Assumably I just need to
>>>> tweak the cache sizes or element counts to hit some kind of non-sweet
>>>> spot, and then it probably fails on my machine.
>>>>
>>>> The testcase always worked on a single box, which lead me to think
>>>> about socket-related issues. But it also required 2 caches to fail,
>>>> which lead me to think about race conditions like the rebalance
>>>> terminating once the first node finishes.
>>>>
>>>> I'm no stranger to reading bug reports like this myself, and I must
>>>> admit this seems pretty tough to diagnose.
>>>>
>>>> Kristian
>>>>
>>>>
>>>> 2016-06-17 14:57 GMT+02:00 Denis Magda <[email protected]>:
>>>>> Hi Kristian,
>>>>>
>>>>> Your test looks absolutely correct for me. However I didn’t manage to
>>>>> reproduce this issue on my side as well.
>>>>>
>>>>> Alex G., do you have any ideas on what can be a reason of that? Can you
>>>>> recommend Kristian enabling of DEBUG/TRACE log levels for particular
>>>>> modules? Probably advanced logging will let us to pin point the issue
>>>>> that
>>>>> happens in Kristian’s environment.
>>>>>
>>>>> —
>>>>> Denis
>>>>>
>>>>> On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <[email protected]>
>>>>> wrote:
>>>>>
>>>>> For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since
>>>>> REPLICATED caches seem to be broken on 1.6 and beyond, I am testing
>>>>> this on 1.5:
>>>>>
>>>>> I can reliably start two nodes and get consistent correct results,
>>>>> lets say each node has 1.5 million elements in a given cache.
>>>>>
>>>>> Once I start a third or fourth node in the same cluster, it
>>>>> consistently gets a random incorrect number of elements in the same
>>>>> cache, typically 1.1 million or so.
>>>>>
>>>>> I tried to create a testcase to reproduce this on my local machine
>>>>>
>>>>> (https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95),
>>>>> but this fails to reproduce the problem.
>>>>>
>>>>> I have two nodes in 2 different datacenters, so there will invariably
>>>>> be some differences in latencies/response times between the existing 2
>>>>> nodes and the newly started node.
>>>>>
>>>>> This sounds like some kind of timing related bug, any tips ? Is there
>>>>> any way I kan skew the timing in the testcase ?
>>>>>
>>>>>
>>>>> Kristian
>>>>>
>>>>>
>>>
>>>
>

Re: Adding a third node to REPLICATED cluster fails to get correct number of elements

Reply via email to