I have a 8 node cluster with 244GB/node, and I see a behavior I don't have any insight into, and which doesn't make sense.
I'm using a custom StreamReceiver which starts a transaction that starts a transaction and updates 4 partitioned caches, 2 of which should be local updates. Ignite Persistence is on, and there is 1 sync backup per cache. I start out with no caches. I'm normally getting about 16K transactions/sec, and that drops to about 1K/s for about 20 minutes, and then recovers. One node starts transmitting/receiving with peaks up to 260 MB/s vs. the normal peaks which are about 60MB/s. The thread count on that node hits a peak and stays there for the duration of the event. The SSD write times are very low. This is prior to filling up the cache, so there are no reads. The transmit BW drops off The logs show nothing interesting, only checkpoints, and their frequency is low. The checkpoint times don't get worse, and their frequency drops off, due to throughput drop. I have 6 threads feeding the DataStreamer from a client node. When each finishes a batch of 200,000 transactions, it waits for the Futures for complete, and will issue a TryFlush if it waits too long. ( The DataStreamer API is not ideal for the case where there are multiple threads using the same stream: when there are multiple streams, the choice is to Flush, which degrades the throughput of the other streams, or to wait, where the data is not sent if the buffers aren't filling. ) . Normally each batch would take 2 minutes or so, in this case the flush did not complete for 20 minutes. At the low point, I was seeing 260 futures completing per second, vs, the normal ~16K. I've attached the current configuration file. This originally occurred when using 64 DataStreamer threads with no other thread counts changed. It also seemed to cause peer class loading to fail and I needed to increase the timeout to avoid that. Thanks, Dave Harvey Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.
<?xml version="1.0" encoding="UTF-8"?> <!-- This file was generated by Ignite Web Console (10/18/2017, 11:17) --> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:util="http://www.springframework.org/schema/util" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util.xsd"> <bean class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="igniteInstanceName" value="Trial"/> <!-- AWS does not support IP broadcast, so we use the S3 bucket approach for discovery of cluster members --> <property name="discoverySpi"> <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> <property name="ipFinder"> <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.s3.TcpDiscoveryS3IpFinder"> <property name="bucketName" value="xxxxxxxxxxxxxxxx"/> <property name="awsCredentials" ref="aws.creds"/> <property name="clientConfiguration"> <bean class="com.amazonaws.ClientConfiguration"> <!-- the default is 3, and in test we had issues where this was not enough to reliably start an instance --> <!-- The S3 SLA talks about error rate per 5 minute period. Circumstantial evidence points to 20s/try timeout --> <property name="maxErrorRetry" value="20"/> </bean> </property> </bean> </property> </bean> </property> <property name="communicationSpi"> <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"> <property name="messageQueueLimit" value="0"/> </bean> </property> <!-- Enable cache events. --> <property name="includeEventTypes"> <list> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CHECKPOINT_SAVED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STARTED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STOPPED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_PART_LOADED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_PART_UNLOADED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_TIMEDOUT"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_FAILED_OVER"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_REJECTED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_CANCELLED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_TIMEDOUT"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_DEPLOY_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOY_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOYED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_UNDEPLOYED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STARTED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STOPPED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_JOINED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_LEFT"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_SEGMENTED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLIENT_NODE_DISCONNECTED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLIENT_NODE_RECONNECTED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_DEPLOYED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_UNDEPLOYED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_DEPLOY_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOYED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_UNDEPLOYED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOY_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STARTED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STOPPED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_NODES_LEFT"/> </list> </property> <property name="dataStreamerThreadPoolSize" value="64"/> <property name="peerClassLoadingEnabled" value="true"/> <!-- Hypothesis: need more threads in this pool than data streamer --> <property name="systemThreadPoolSize" value="70"/> <!-- Use different sizes so we can predict which pool is saturated --> <property name="publicThreadPoolSize" value="24"/> <property name="igfsThreadPoolSize" value="40"/> <property name="stripedPoolSize" value="78"/> <property name="asyncCallbackPoolSize" value="56"/> <!-- Set 4 threads for rebalancing. --> <property name="rebalanceThreadPoolSize" value="4"/> <!-- Diagnose cause of 5s peer class loading timeout --> <property name="networkTimeout" value="60000"/> <property name="dataStorageConfiguration"> <!-- The entire cluster uses Ignite Persistence (cluster-wide setting) --> <!-- Use separate directories that can be placed on different EBS volumes --> <bean class="org.apache.ignite.configuration.DataStorageConfiguration"> <!-- Set the page size to 4 KB --> <property name="pageSize" value="4096"/> <!-- switched store/wal to understand higher BW behavior for WAL --> <property name="storagePath" value="/IgnitePersistenceStorage/wal"/> <property name="walPath" value="/IgnitePersistenceStorage/store"/> <property name="walArchivePath" value="/IgnitePersistenceStorage/wal/archive"/> <!-- Enable write throttling. --> <!-- property name="writeThrottlingEnabled" value="false"/ --> <property name="defaultDataRegionConfiguration"> <bean class="org.apache.ignite.configuration.DataRegionConfiguration"> <!-- Enabling persistence. --> <property name="persistenceEnabled" value="true"/> <!-- Increasing the buffer size to 1 GB. --> <property name="checkpointPageBufferSize" value="#{1024L * 1024 * 1024}"/> <property name="name" value="Default_Region"/> <!-- Setting the size of the default region to 160GB. --> <property name="maxSize" value="#{160L * 1024 * 1024 * 1024}"/> </bean> </property> </bean> </property> </bean> <!-- The S3 buckets used allow access to Ignite nodes themselves, so we don't need additional credentials --> <bean id="aws.creds" class="com.amazonaws.auth.AnonymousAWSCredentials"/> </beans>