Igor,

Thanks for responding.

I have 2 java singletons that I use. The first is the CacheManager which
starts the instance. The second is another Singleton that caches the
repository Name + CacheClient so
we can reuse them through out the process.

/**
 * This is a singleton for Apache Ingite, it starts the client instance.
 */
public class IgniteCacheManager {
    private static IgniteCacheManager instance;
    private IgniteClient igniteClient;
    private ClientConfiguration cfg;

    public static IgniteCacheManager getInstance(YamlConfig cacheConfig) {
        if (instance == null) instance = new
IgniteCacheManager(cacheConfig);
        return instance;
    }

    private IgniteCacheManager(YamlConfig cacheConfig) {
        String hostsMap = cacheConfig.getString("hosts");
        String[] hosts = null;
        if (hostsMap != null) {
            hosts = hostsMap.split(",");
        } else {
            hosts = new String[] { "localhost:10800" };
        }
        cfg = new ClientConfiguration().setAddresses(hosts)
                .setTimeout(cacheConfig.getInteger("cache.timeout"));
        igniteClient = Ignition.startClient(cfg);
    }

    public IgniteClient getClient() {
        return this.igniteClient;
    }

    public void reconnect() {
        igniteClient = Ignition.startClient(cfg);
    }
}


public class CacheFactory {
    private static CacheFactory instance;
    private YamlConfig cacheConfig;
    private Map<String, CacheClient<?>> clientCache = new
ConcurrentHashMap<>();

    public static CacheFactory getInstance(YamlConfig cacheConfig) {
        if (instance == null) instance = new CacheFactory(cacheConfig);
        return instance;
    }

    /**
     * This is the main factory method for pulling a Cache Instance to
begin Caching.
     */
    public <T> CacheClient<T> getCacheProvider(Class<T> t, String
repository) {
        CacheClient<T> client = null;
        if (clientCache.containsKey(repository)) {
            client = (CacheClient<T>) clientCache.get(repository);
        } else {
                try {
                    client = new IgniteCacheClientWrapper<T>(cacheConfig,
repository);
                    clientCache.put(repository, client);
                } catch (Exception ex) {
                    /**
                     * If we encounter any errors return null and let the
caller decide how to act on the null response.
                     */
                     LOG.error(ex);
                }
        }
        return client;
    }
}

To call this we use this method inside all of our request threads.

public <t> someMethod() {

}

public <T> CacheClient<T> getCacheClient(Class<T> t, String key) {
   factory = CacheFactory.getInstance(cacheConfig);
   return factory.getCacheProvider(t, key);
 }

...
getCacheClient(StorageContainer.class, partner).get(id);.
...


The spikes are unpredictable, we see normal load on all 3 nodes, however we
do see a huge spike in these errors around the time the hosts lock up.

Mar 25 06:25:04 prd-cache001 service.sh[10538]: java.io.IOException:
Connection reset by peer
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.FileDispatcherImpl.read0(Native Method)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.IOUtil.read(IOUtil.java:197)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1104)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2389)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2156)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1797)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
Mar 25 06:25:04 prd-cache001 service.sh[10538]: #011at
java.lang.Thread.run(Thread.java:748)
Mar 25 06:25:04 prd-cache001 service.sh[10538]:
[06:25:04,742][SEVERE][grid-nio-worker-client-listener-1-#30][ClientListenerProcessor]
Failed to process selector key [ses=GridSelectorNioSessionImpl
[worker=ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=0
lim=8192 cap=8192], super=AbstractNioClientWorker [idx=1, bytesRcvd=0,
bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker
[name=grid-nio-worker-client-listener-1, igniteInstanceName=null,
finished=false, heartbeatTs=1553520191555, hashCode=1720789126,
interrupted=false, runner=grid-nio-worker-client-listener-1-#30]]],
writeBuf=null, readBuf=null, inRecovery=null, outRecovery=null,
super=GridNioSessionImpl [locAddr=/10.132.52.64:10800, rmtAddr=/
10.132.52.59:49105, createTime=1553519004533, closeTime=0, bytesSent=5,
bytesRcvd=12, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1553519005007,
lastSndTime=1553519005525, lastRcvTime=1553519005007, readsPaused=false,
filterChain=FilterChain[filters=[GridNioAsyncNotifyFilter,
GridNioCodecFilter [parser=ClientListenerBufferedParser,
directMode=false]], accepted=true, markedForClose=false]]]

Then we see the TCP connection count go way up, too many files open and
request times take forever. One thing I have changed was I did increase the
timeout on the client, I had it at 100 ms but I increased it to 250 ms. Not
sure if the load cause connection to timeout so it spawns more connections
or if the GC causes the connections to hang.

My 3 nodes are 2 CPU, 8 GB RAM, during load peaks, we see averages of 6%
CPU with 30% memory utilization. Here is the Settings I have for me GC.

/usr/bin/java -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch -XX:+UseG1GC
-XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:MaxMetaspaceSize=256m
-Djava.net.preferIPv4Stack=true -DIGNITE_QUIET=true
-DIGNITE_SUCCESS_FILE=/usr/share/apache-ignite/work/ignite_success_274df869-bebf-47d0-8c9e-6b2da78f1f09
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=49112
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-DIGNITE_HOME=/usr/share/apache-ignite
-DIGNITE_PROG_NAME=/usr/share/apache-ignite/bin/ignite.sh -cp
/usr/share/apache-ignite/libs/*:/usr/share/apache-ignite/libs/ignite-indexing/*:/usr/share/apache-ignite/libs/ignite-rest-http/*:/usr/share/apache-ignite/libs/ignite-spring/*:/usr/share/apache-ignite/libs/licenses/*
org.apache.ignite.startup.cmdline.CommandLineStartup
/etc/apache-ignite/default-config.xml



On Wed, Mar 27, 2019 at 4:27 AM Igor Sapego <[email protected]> wrote:

> That's really weird. There should not be so much connections. Normally thin
> client will open one TCP connection per node at max. In many cases, there
> going to be only one connection.
>
> Do you create IgniteClient in your application once, or do you start them
> several
> times? Could it be that your code are leaking IgniteClient instances?
>
> Can you provide some minimal reproducer to us, so we can debug the issue?
>
> Best Regards,
> Igor
>
>
> On Mon, Mar 25, 2019 at 11:19 PM Brent Williams <[email protected]>
> wrote:
>
>> All,
>>
>> I am running Apache Ingite 2.7.0. I have 3 nodes in my cluster, CPU,
>> memory, GC all tuned properly. I have even adjusted file limit to 65k open
>> connections. I have 8 client nodes that are connecting to the 3 node
>> cluster and for the most part working fine, however, we see spikes in
>> connections and we start to blow out the file limit and we get too many
>> files open and all client nodes hang.
>>
>> When I check the connections per client on one of the server nodes, I am
>> seeing 5500+ TCP connections established per host.  This is roughly 44,0000
>> + . My question is what should the file limits be? Why so many TCP
>> connections per host? How do we control this as it is causing our
>> production cluster to hang.
>>
>> --Brent
>>
>>
>>

Reply via email to