Re: Help with tuning for larger clusters

Denis Magda Tue, 03 Nov 2015 06:43:12 -0800

Hi Joe,

It's nice to hear from you. Please see below.


On 11/3/2015 3:48 PM, [email protected] wrote:

Sorry for the delayed response. Thanks for opening the jira bug, I hadalso noticed there is another being actively worked about rebalancingbeing slow.
1) Yep, before dropping the port range it took several minutes beforeeveryone joined the topology. Remember I can't use multicast so I havea single IP configured that everyone has to talk to for discovery.
1a) The underlying network is FDR infiniband. All throughput andlatency numbers are as expected with both IB based benchmarks. I'vealso run sockperf between nodes to get socket/IP performance and itwas as expected (it takes a pretty big hit in both throughput andlatency, but that is normal with the IP stack.) I don't have thenumbers handy, but I believe sockperf showed about 2.2 GBytes/sthroughput for any single point-to-point connection.
1b) The cluster has a shared login node and the filesystem is shared,otherwise the individual nodes that I am launching ignite.sh on areexclusively mine, their own physical entities, and not being used foranything else. I'm not taking all the cluster nodes so there areother people running on other nodes accessing both the IB network andthe shared filesystem(but not my ignite installation directory, so notthe same files)

*Ivan*, don't we have any know IGFS-related issues when a sharedfilesystem is used by nodes?

2) lol, yeah, that is what I was trying to do when I started thethread. I'll go back and start that process again.

Before trying to play with every parameter try to increase that oneTcpCommunicationSpi.socketWriteTimeout. In your case it was initializedby default value (5 secs).

3) Every now and then I have an ignite process that doesn't shutdownwith my pssh kill command and required a kill -9. I try to check everynode to make sure all the java processes have terminated (pssh ps -eaf| grep java) but I could have missed one. I'll try to keep an eye outfor those messages as well. I've also had issues where I've stoppedand restarted the nodes too quick and the port isn't released yet.

I would recommend you to use 'jps' tool to get a list of all runningJava processes because sometimes the processes are renamed to non 'java'name.

http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html

4) Over the weekend I had a successful 64 node run, and when it cameup I didn't see any "Retry partition exchange messages". I let it sitfor a couple hours and everything stayed up and happy. I then startedrunning pi estimator with increasing number of mappers. I think it iswhen I was doing 10000 mappers that it got about 71% through and thenstopped making progress although I kept seeing the ignite messages forinter node communication. When I noticed it was "stuck" then there wasan NIO exception in the logs. I haven't looked at the logs in detailyet but the topology seemed intact and everything was up and runningwell over 12 hours.

Could you share example's source code with us? Probably we will notesomething strange.In addition, next time when your nodes get stuck please make threaddumps and heap dumps and share with us for analysis.


--
Denis

I might need to put this on the back burner for a little bit, we'll see.

Joe



Quoting Denis Magda <[email protected]>:
Joe,

Thanks for the clarifications. Now we're on the same page.
It's great that the cluster is initially assembled without any issueand you
see that all 64 joined the topology.
In regards to 'rebalancing timeout' warnings I have the followingthoughts.
First, I've opened a bug that describes your and similar cases thathappens
on big cluster with rebalancing. You may want to track it:
https://issues.apache.org/jira/browse/IGNITE-1837
Second, I'm not sure that this bug is 100% your case and doesn'tguaranteethat the issue on your side disappears when it gets fixed. That's whylets
check the following.
1) As far as I remember before we decreased the port range used bydiscoveryit took significant time for you to form the cluster of 64 nodes.What arethe settings of your network (throughput, 10GB or 1GB)? How do youuse thisservers? Are they already under the load by some other apps thatdecreasenetwork throughput? I think you should find out whether everything isOK in
this area or not. IMHO at least the situation is not ideal.
2) Please increate TcpCommunicationSpi.socketWriteTimeout to 15 secs(the
same value that failureDetectionTimeout has).
Actually you may want to try configuring network related parametersdirectly
instead of relying on failureDetectionTimeout:
- TcpCommunicationSpi.socketWriteTimeout
- TcpCommunicationSpi.connectTimeout
- TcpDiscoverySpi.socketTimeout
- TcpDiscoverySpi.ackTimeout
3) In some logs I see that IGFS endpoint failed to start. Pleasecheck who
occupies that port number.
[07:33:41,736][WARN ][main][IgfsServerManager] Failed to start IGFSendpoint
(will retry every 3s). Failed to bind to port (is port already in use?):
10500

4) Please turn off IGFS/HDFS/Hadoop at all and start the cluster. Let's
check how long it will live in the idle state. But please take intoaccount
1) before.

Regards,
Denis




--
View this message in context:http://apache-ignite-users.70518.x6.nabble.com/Help-with-tuning-for-larger-clusters-tp1692p1814.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Help with tuning for larger clusters

Reply via email to