Re: Adding a third node to REPLICATED cluster fails to get correct number of elements

Kristian Rosenvold Thu, 23 Jun 2016 09:52:46 -0700

I assume the use case of passing the hashCode is to be able to put the
object directly into a hashmap bucket without constructing its state.


Would it be realistic to do something like this:
The first time *ever* an object arrives for a given cache (or when the
first object arrives in the cache empty)
Reconstruct the object, ask for its hashCode. If this mismatches the
one transmitted over the wire, complain violently ?

A non-moveable hashcode will in most cases be revealed immediately ?

Kristian


2016-06-23 11:20 GMT+02:00 Denis Magda <[email protected]>:
> Seems that this.getClass().hashCode() executed on different VMs can produce 
> different result (but it should always produce the same result on a single VM 
> which doesn’t violate JVM specification). Ignite requires the hashCode of a 
> key to be consistent cluster-wide. So Ignite has even more stronger 
> requirement then JVM spec.
>
> —
> Denis
>
>> On Jun 23, 2016, at 11:30 AM, Kristian Rosenvold <[email protected]> 
>> wrote:
>>
>> We think the issue may regard transportability of the hashCode across
>> nodes, because the hashcode in question included the hashcode of a
>> class (in other words this.getClass().hashCode() as opposed to the
>> more robust this.getClass().getName().hashCode())
>>
>> Does ignite require the hashCode of a key to be cluster-wide consistent ?
>>
>> (This would actually be a violation of the javadoc contract for
>> hashcode, which states "This integer need not remain consistent from
>> one execution of an application to another execution of the same
>> application.". But it should possible to actually test for this if it
>> is a constraint required by ignite.)
>>
>> If this does not appear to be the problem, I can supply the code in question.
>>
>> Kristian
>>
>>
>>
>> 2016-06-23 10:05 GMT+02:00 Denis Magda <[email protected]>:
>>> Hi Kristian,
>>>
>>> Could you share the source of a class that has inconsistent equals/hashCode 
>>> implementation? Probably we will be able to detect your case internally 
>>> somehow and print a warning.
>>>
>>> —
>>> Denis
>>>
>>>> On Jun 17, 2016, at 10:27 PM, Kristian Rosenvold <[email protected]> 
>>>> wrote:
>>>>
>>>> This whole issue was caused by inconsistent equals/hashCode on a cache
>>>> key, which appearantly has the capability of stopping replication dead
>>>> in its tracks. Nailing this one after 3-4 days of a very nagging
>>>> "select is broken" feeling was great. You guys helping us here might
>>>> want to be particularly aware of this, since it undeniably gives a newbie 
>>>> an
>>>> impression that ignite is broken while it's my code :)
>>>>
>>>> Thanks for the help !
>>>>
>>>> Kristian
>>>>
>>>>
>>>> 2016-06-17 20:00 GMT+02:00 Alexey Goncharuk <[email protected]>:
>>>>> Kristian,
>>>>>
>>>>> Are you sure you are using the latest 1.7-SNAPSHOT for your production 
>>>>> data?
>>>>> Did you build binaries yourself? Can you confirm the commit# of the 
>>>>> binaries
>>>>> you are using? The issue you are reporting seems to be the same as
>>>>> IGNITE-3305 and, since the fix was committed only a couple of days ago, it
>>>>> might not get to nightly snapshot.
>>>>>
>>>>> 2016-06-17 9:06 GMT-07:00 Kristian Rosenvold <[email protected]>:
>>>>>>
>>>>>> Sigh, this has all the hallmarks of a thread safety issue or race
>>>>>> condition.
>>>>>>
>>>>>> I had a perfect testcase that replicated the problem 100% of the time,
>>>>>> but only when running on distinct nodes (never occurs on same box)
>>>>>> with 2 distinct caches and with ignite 1.5; I just expanded the
>>>>>> testcase I posted initially . Typically I'd be missing the last 10-20
>>>>>> elements in the cache. I was about 2 seconds from reporting an issue
>>>>>> and then I switched to yesterday's 1.7-SNAPSHOT version and it went
>>>>>> away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my
>>>>>> production data, it just broke my testcase :( Assumably I just need to
>>>>>> tweak the cache sizes or element counts to hit some kind of non-sweet
>>>>>> spot, and then it probably fails on my machine.
>>>>>>
>>>>>> The testcase always worked on a single box, which lead me to think
>>>>>> about socket-related issues. But it also required 2 caches to fail,
>>>>>> which lead me to think about race conditions like the rebalance
>>>>>> terminating once the first node finishes.
>>>>>>
>>>>>> I'm no stranger to reading bug reports like this myself, and I must
>>>>>> admit this seems pretty tough to diagnose.
>>>>>>
>>>>>> Kristian
>>>>>>
>>>>>>
>>>>>> 2016-06-17 14:57 GMT+02:00 Denis Magda <[email protected]>:
>>>>>>> Hi Kristian,
>>>>>>>
>>>>>>> Your test looks absolutely correct for me. However I didn’t manage to
>>>>>>> reproduce this issue on my side as well.
>>>>>>>
>>>>>>> Alex G., do you have any ideas on what can be a reason of that? Can you
>>>>>>> recommend Kristian enabling of DEBUG/TRACE log levels for particular
>>>>>>> modules? Probably advanced logging will let us to pin point the issue
>>>>>>> that
>>>>>>> happens in Kristian’s environment.
>>>>>>>
>>>>>>> —
>>>>>>> Denis
>>>>>>>
>>>>>>> On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since
>>>>>>> REPLICATED caches seem to be broken on 1.6 and beyond, I am testing
>>>>>>> this on 1.5:
>>>>>>>
>>>>>>> I can reliably start two nodes and get consistent correct results,
>>>>>>> lets say each node has 1.5 million elements in a given cache.
>>>>>>>
>>>>>>> Once I start a third or fourth node in the same cluster, it
>>>>>>> consistently gets a random incorrect number of elements in the same
>>>>>>> cache, typically 1.1 million or so.
>>>>>>>
>>>>>>> I tried to create a testcase to reproduce this on my local machine
>>>>>>>
>>>>>>> (https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95),
>>>>>>> but this fails to reproduce the problem.
>>>>>>>
>>>>>>> I have two nodes in 2 different datacenters, so there will invariably
>>>>>>> be some differences in latencies/response times between the existing 2
>>>>>>> nodes and the newly started node.
>>>>>>>
>>>>>>> This sounds like some kind of timing related bug, any tips ? Is there
>>>>>>> any way I kan skew the timing in the testcase ?
>>>>>>>
>>>>>>>
>>>>>>> Kristian
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>

Re: Adding a third node to REPLICATED cluster fails to get correct number of elements

Reply via email to