Hello! I can already see that you're not closing IgniteClient when you reconnect. Meaning it will not free resources.
Have you considered using Ignite (with Ignition.setClientMode(true)) instead of IgniteClient? Regards, -- Ilya Kasnacheev ср, 27 мар. 2019 г. в 22:03, Brent Williams <[email protected]>: > Igor, > > Thanks for responding. > > I have 2 java singletons that I use. The first is the CacheManager which > starts the instance. The second is another Singleton that caches the > repository Name + CacheClient so > we can reuse them through out the process. > > /** > * This is a singleton for Apache Ingite, it starts the client instance. > */ > public class IgniteCacheManager { > private static IgniteCacheManager instance; > private IgniteClient igniteClient; > private ClientConfiguration cfg; > > public static IgniteCacheManager getInstance(YamlConfig cacheConfig) { > if (instance == null) instance = new > IgniteCacheManager(cacheConfig); > return instance; > } > > private IgniteCacheManager(YamlConfig cacheConfig) { > String hostsMap = cacheConfig.getString("hosts"); > String[] hosts = null; > if (hostsMap != null) { > hosts = hostsMap.split(","); > } else { > hosts = new String[] { "localhost:10800" }; > } > cfg = new ClientConfiguration().setAddresses(hosts) > .setTimeout(cacheConfig.getInteger("cache.timeout")); > igniteClient = Ignition.startClient(cfg); > } > > public IgniteClient getClient() { > return this.igniteClient; > } > > public void reconnect() { > igniteClient = Ignition.startClient(cfg); > } > } > > > public class CacheFactory { > private static CacheFactory instance; > private YamlConfig cacheConfig; > private Map<String, CacheClient<?>> clientCache = new > ConcurrentHashMap<>(); > > public static CacheFactory getInstance(YamlConfig cacheConfig) { > if (instance == null) instance = new CacheFactory(cacheConfig); > return instance; > } > > /** > * This is the main factory method for pulling a Cache Instance to > begin Caching. > */ > public <T> CacheClient<T> getCacheProvider(Class<T> t, String > repository) { > CacheClient<T> client = null; > if (clientCache.containsKey(repository)) { > client = (CacheClient<T>) clientCache.get(repository); > } else { > try { > client = new IgniteCacheClientWrapper<T>(cacheConfig, > repository); > clientCache.put(repository, client); > } catch (Exception ex) { > /** > * If we encounter any errors return null and let the > caller decide how to act on the null response. > */ > LOG.error(ex); > } > } > return client; > } > } > > To call this we use this method inside all of our request threads. > > public <t> someMethod() { > > } > > public <T> CacheClient<T> getCacheClient(Class<T> t, String key) { > factory = CacheFactory.getInstance(cacheConfig); > return factory.getCacheProvider(t, key); > } > > ... > getCacheClient(StorageContainer.class, partner).get(id);. > ... > > > The spikes are unpredictable, we see normal load on all 3 nodes, however > we do see a huge spike in these errors around the time the hosts lock up. > > Mar 25 06:25:04 prd-cache001 service.sh[10538]: java.io.IOException: > Connection reset by peer > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > sun.nio.ch.FileDispatcherImpl.read0(Native Method) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > sun.nio.ch.IOUtil.read(IOUtil.java:197) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1104) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2389) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2156) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1797) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at > java.lang.Thread.run(Thread.java:748) > Mar 25 06:25:04 prd-cache001 service.sh[10538]: > [06:25:04,742][SEVERE][grid-nio-worker-client-listener-1-#30][ClientListenerProcessor] > Failed to process selector key [ses=GridSelectorNioSessionImpl > [worker=ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=0 > lim=8192 cap=8192], super=AbstractNioClientWorker [idx=1, bytesRcvd=0, > bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker > [name=grid-nio-worker-client-listener-1, igniteInstanceName=null, > finished=false, heartbeatTs=1553520191555, hashCode=1720789126, > interrupted=false, runner=grid-nio-worker-client-listener-1-#30]]], > writeBuf=null, readBuf=null, inRecovery=null, outRecovery=null, > super=GridNioSessionImpl [locAddr=/10.132.52.64:10800, rmtAddr=/ > 10.132.52.59:49105, createTime=1553519004533, closeTime=0, bytesSent=5, > bytesRcvd=12, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1553519005007, > lastSndTime=1553519005525, lastRcvTime=1553519005007, readsPaused=false, > filterChain=FilterChain[filters=[GridNioAsyncNotifyFilter, > GridNioCodecFilter [parser=ClientListenerBufferedParser, > directMode=false]], accepted=true, markedForClose=false]]] > > Then we see the TCP connection count go way up, too many files open and > request times take forever. One thing I have changed was I did increase the > timeout on the client, I had it at 100 ms but I increased it to 250 ms. Not > sure if the load cause connection to timeout so it spawns more connections > or if the GC causes the connections to hang. > > My 3 nodes are 2 CPU, 8 GB RAM, during load peaks, we see averages of 6% > CPU with 30% memory utilization. Here is the Settings I have for me GC. > > /usr/bin/java -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch -XX:+UseG1GC > -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxMetaspaceSize=256m > -Djava.net.preferIPv4Stack=true -DIGNITE_QUIET=true > -DIGNITE_SUCCESS_FILE=/usr/share/apache-ignite/work/ignite_success_274df869-bebf-47d0-8c9e-6b2da78f1f09 > -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=49112 > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -DIGNITE_HOME=/usr/share/apache-ignite > -DIGNITE_PROG_NAME=/usr/share/apache-ignite/bin/ignite.sh -cp > /usr/share/apache-ignite/libs/*:/usr/share/apache-ignite/libs/ignite-indexing/*:/usr/share/apache-ignite/libs/ignite-rest-http/*:/usr/share/apache-ignite/libs/ignite-spring/*:/usr/share/apache-ignite/libs/licenses/* > org.apache.ignite.startup.cmdline.CommandLineStartup > /etc/apache-ignite/default-config.xml > > > > On Wed, Mar 27, 2019 at 4:27 AM Igor Sapego <[email protected]> wrote: > >> That's really weird. There should not be so much connections. Normally >> thin >> client will open one TCP connection per node at max. In many cases, there >> going to be only one connection. >> >> Do you create IgniteClient in your application once, or do you start them >> several >> times? Could it be that your code are leaking IgniteClient instances? >> >> Can you provide some minimal reproducer to us, so we can debug the issue? >> >> Best Regards, >> Igor >> >> >> On Mon, Mar 25, 2019 at 11:19 PM Brent Williams <[email protected]> >> wrote: >> >>> All, >>> >>> I am running Apache Ingite 2.7.0. I have 3 nodes in my cluster, CPU, >>> memory, GC all tuned properly. I have even adjusted file limit to 65k open >>> connections. I have 8 client nodes that are connecting to the 3 node >>> cluster and for the most part working fine, however, we see spikes in >>> connections and we start to blow out the file limit and we get too many >>> files open and all client nodes hang. >>> >>> When I check the connections per client on one of the server nodes, I am >>> seeing 5500+ TCP connections established per host. This is roughly 44,0000 >>> + . My question is what should the file limits be? Why so many TCP >>> connections per host? How do we control this as it is causing our >>> production cluster to hang. >>> >>> --Brent >>> >>> >>>
