Re: Looks like a bug in ServerImpl.joinTopology()

2018-05-04 Thread Anton Vinogradov
Alex,

I checked reproducer you presented, please fix it according to the
following rules

1) Never use System.out.println() as a part of reproducer, use assetrions
if necessary.
2) Reproducer should be small. As small as possible.
3) Try to make reproducer clear. As crear as possible.
In case following code can be simplified - it should be.
if (flag && latch.getCount() > 0) {
   fut.onDone(this);
3.1) Newer use variable's names like 'flag', this makes reviewers nervous.

пт, 27 апр. 2018 г. в 17:40, Александр Меньшиков :

> Yakov, thank you for the advice.
>
> The thread.sleep is not enough, but some latch + future give me a way to
> the
> reproducer.
>
> I have created PR [1] into my master, for showing a test and modification
> of
> ServerImpl which help me to slow down execution inside a danger section.
>
> A code of test a bit long, but basically it about two parts:
>
> In the first part, I randomly start and stop nodes to get a moment when
> a server is starting to execute the dangerous code which I described in the
> first message.
>
> In the second part, I'm waiting while the first part produces this
> situation
> and after that, I call public method of ServerImpl which fails with an
> exception:
>
> java.lang.AssertionError: Invalid node order: TcpDiscoveryNode
> [id=f6bf048d-378b-4960-94cb-84e3d332, addrs=[127.0.0.1], sockAddrs=[/
> 127.0.0.1:47502], discPort=47502, order=0, intOrder=2,
> lastExchangeTime=1524836605995, loc=false,
> ver=2.5.0#20180426-sha1:34e22396, isClient=false]
> at
>
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:52)
> at
>
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:49)
> at
> org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:2014)
> at
>
> org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9679)
> at
>
> org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9652)
> at
>
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:590)
> at
>
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleRemoteNodes(TcpDiscoveryNodesRing.java:164)
> at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl.getRemoteNodes(ServerImpl.java:304)
>
> As I told in the first message the problem arises because of the current
> code
> changes local node internal order and breaks sorting in
> TcpDiscoveryNodesRing.nodes collection.
>
> Is this reproducer convince enough?
>
> [1] Reproducer: https://github.com/SharplEr/ignite/pull/10/files
>
>
>
> 2018-02-13 20:17 GMT+03:00 Yakov Zhdanov :
>
> > Alex, you can alter ServerImpl and insert a latch or thread.sleep(xxx)
> > anywhere you like to show the incorrect behavior you describe.
> >
> > --Yakov
> >
>


Re: Looks like a bug in ServerImpl.joinTopology()

2018-04-28 Thread Александр Меньшиков
+ Anton Vinogradov

2018-04-27 17:40 GMT+03:00 Александр Меньшиков :

> Yakov, thank you for the advice.
>
> The thread.sleep is not enough, but some latch + future give me a way to
> the
> reproducer.
>
> I have created PR [1] into my master, for showing a test and modification
> of
> ServerImpl which help me to slow down execution inside a danger section.
>
> A code of test a bit long, but basically it about two parts:
>
> In the first part, I randomly start and stop nodes to get a moment when
> a server is starting to execute the dangerous code which I described in the
> first message.
>
> In the second part, I'm waiting while the first part produces this
> situation
> and after that, I call public method of ServerImpl which fails with an
> exception:
>
> java.lang.AssertionError: Invalid node order: TcpDiscoveryNode
> [id=f6bf048d-378b-4960-94cb-84e3d332, addrs=[127.0.0.1], sockAddrs=[/
> 127.0.0.1:47502], discPort=47502, order=0, intOrder=2, 
> lastExchangeTime=1524836605995,
> loc=false, ver=2.5.0#20180426-sha1:34e22396, isClient=false]
> at org.apache.ignite.spi.discovery.tcp.internal.
> TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:52)
> at org.apache.ignite.spi.discovery.tcp.internal.
> TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:49)
> at org.apache.ignite.internal.util.lang.GridFunc.isAll(
> GridFunc.java:2014)
> at org.apache.ignite.internal.util.IgniteUtils.arrayList(
> IgniteUtils.java:9679)
> at org.apache.ignite.internal.util.IgniteUtils.arrayList(
> IgniteUtils.java:9652)
> at org.apache.ignite.spi.discovery.tcp.internal.
> TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:590)
> at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.
> visibleRemoteNodes(TcpDiscoveryNodesRing.java:164)
> at org.apache.ignite.spi.discovery.tcp.ServerImpl.
> getRemoteNodes(ServerImpl.java:304)
>
> As I told in the first message the problem arises because of the current
> code
> changes local node internal order and breaks sorting in
> TcpDiscoveryNodesRing.nodes collection.
>
> Is this reproducer convince enough?
>
> [1] Reproducer: https://github.com/SharplEr/ignite/pull/10/files
>
>
>
>
> 2018-02-13 20:17 GMT+03:00 Yakov Zhdanov :
>
>> Alex, you can alter ServerImpl and insert a latch or thread.sleep(xxx)
>> anywhere you like to show the incorrect behavior you describe.
>>
>> --Yakov
>>
>
>


Re: Looks like a bug in ServerImpl.joinTopology()

2018-04-27 Thread Александр Меньшиков
Yakov, thank you for the advice.

The thread.sleep is not enough, but some latch + future give me a way to the
reproducer.

I have created PR [1] into my master, for showing a test and modification of
ServerImpl which help me to slow down execution inside a danger section.

A code of test a bit long, but basically it about two parts:

In the first part, I randomly start and stop nodes to get a moment when
a server is starting to execute the dangerous code which I described in the
first message.

In the second part, I'm waiting while the first part produces this situation
and after that, I call public method of ServerImpl which fails with an
exception:

java.lang.AssertionError: Invalid node order: TcpDiscoveryNode
[id=f6bf048d-378b-4960-94cb-84e3d332, addrs=[127.0.0.1], sockAddrs=[/
127.0.0.1:47502], discPort=47502, order=0, intOrder=2,
lastExchangeTime=1524836605995, loc=false,
ver=2.5.0#20180426-sha1:34e22396, isClient=false]
at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:52)
at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:49)
at
org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:2014)
at
org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9679)
at
org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9652)
at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:590)
at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleRemoteNodes(TcpDiscoveryNodesRing.java:164)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.getRemoteNodes(ServerImpl.java:304)

As I told in the first message the problem arises because of the current
code
changes local node internal order and breaks sorting in
TcpDiscoveryNodesRing.nodes collection.

Is this reproducer convince enough?

[1] Reproducer: https://github.com/SharplEr/ignite/pull/10/files



2018-02-13 20:17 GMT+03:00 Yakov Zhdanov :

> Alex, you can alter ServerImpl and insert a latch or thread.sleep(xxx)
> anywhere you like to show the incorrect behavior you describe.
>
> --Yakov
>


Re: Looks like a bug in ServerImpl.joinTopology()

2018-02-13 Thread Yakov Zhdanov
Alex, you can alter ServerImpl and insert a latch or thread.sleep(xxx)
anywhere you like to show the incorrect behavior you describe.

--Yakov


Looks like a bug in ServerImpl.joinTopology()

2018-02-13 Thread Александр Меньшиков
Hello.

I saw such code in `ServerImpl.joinTopology()`


locNode.order(1);

locNode.internalOrder(1);

spi.gridStartTime = U.currentTimeMillis();

locNode.visible(true);

ring.clear();

ring.topologyVersion(1);



And it looks like a bug because the `locNode` is contained inside the
`ring` (`TcpDiscoveryNodesRing.locNode` which also be inside a `
TcpDiscoveryNodesRing.nodes` collection) and every operation with the `
TcpDiscoveryNodesRing.nodes` is executed under a read-write lock. And not
without a reason. `locNode.order` used inside the `
TcpDiscoveryNodesRing.nodes` for sorting (it's TreeSet) and such violation
of thread safety can destroy collection navigation.

The `TcpDiscoveryNode.internalOrder` is volatile and `ring.clear()` line
resets the`TcpDiscoveryNodesRing.nodes` collection, so that issue is
hidden. But if another thread would execute finding operation on the
collection after `locNode.internalOrder(1)`, but before `ring.clear()` the
issue will appear.

But it's hard to create fair reproducer for this situation.

Am I right about that and should create an issue in Jira or I just miss
something?