20 minute 12x throughput drop using data streamer and Ignite persistence

David Harvey Mon, 12 Feb 2018 06:56:08 -0800

I have a 8 node cluster with 244GB/node, and I see a behavior I don't have
any insight into, and which doesn't make sense.


I'm using a custom StreamReceiver which starts a transaction that starts a
transaction and updates 4 partitioned caches, 2 of which should be local
updates.   Ignite Persistence is on, and there is 1 sync backup per cache.


I start out with no caches.   I'm normally getting about 16K
transactions/sec, and that drops to about 1K/s for about 20 minutes, and
then recovers.

One node starts transmitting/receiving with peaks up to 260 MB/s vs. the
normal peaks which are about 60MB/s.  The thread count on that node hits a
peak and stays there for the duration of the event.   The SSD write times
are very low.  This is prior to filling up the cache, so there are no
reads.   The transmit BW drops off


The logs show nothing interesting, only checkpoints, and their frequency is
low.  The checkpoint times don't get worse, and their frequency drops off,
due to throughput drop.

I have 6 threads feeding the DataStreamer from a client node.  When each
finishes a batch of 200,000 transactions, it waits for the Futures for
complete, and will issue a TryFlush if it waits too long.  ( The
DataStreamer API  is not ideal for the case where there are multiple
threads using the same stream: when there are multiple streams,  the choice
is to Flush, which degrades the throughput of the other streams, or to
wait, where the data is not sent if the buffers aren't filling.  ) .
Normally each batch would take 2 minutes or so, in this case the flush did
not complete for 20 minutes.   At the low point, I was seeing 260 futures
completing per second, vs, the normal ~16K.

I've attached the current configuration file.  This originally occurred
when using 64 DataStreamer threads with no other thread counts changed.  It
also seemed to cause peer class loading to fail and I needed to increase
the timeout to avoid that.

Thanks,
Dave Harvey

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

<?xml version="1.0" encoding="UTF-8"?>

<!-- This file was generated by Ignite Web Console (10/18/2017, 11:17) -->

<beans xmlns="http://www.springframework.org/schema/beans";
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
       xmlns:util="http://www.springframework.org/schema/util";
       xsi:schemaLocation="http://www.springframework.org/schema/beans
                           http://www.springframework.org/schema/beans/spring-beans.xsd
                           http://www.springframework.org/schema/util
                           http://www.springframework.org/schema/util/spring-util.xsd";>
    <bean class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="igniteInstanceName" value="Trial"/>

        <!-- AWS does not support IP broadcast, so we use the S3 bucket approach for discovery of cluster members -->
        <property name="discoverySpi">
            <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                <property name="ipFinder">
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.s3.TcpDiscoveryS3IpFinder">
                        <property name="bucketName" value="xxxxxxxxxxxxxxxx"/>
                        <property name="awsCredentials" ref="aws.creds"/>
                        <property name="clientConfiguration">
                            <bean class="com.amazonaws.ClientConfiguration">
                                <!-- the default is 3, and in test we had issues where this was not enough to reliably start an instance -->
                                <!-- The S3 SLA talks about error rate per 5 minute period.  Circumstantial evidence points to 20s/try timeout -->
                                <property name="maxErrorRetry" value="20"/>
                            </bean>
                        </property>
                    </bean>
                </property>
            </bean>
          </property>

          <property name="communicationSpi">
             <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
                  <property name="messageQueueLimit" value="0"/>
             </bean>
          </property>

        <!-- Enable cache events. -->
        <property name="includeEventTypes">
          <list>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_CHECKPOINT_SAVED"/>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STARTED"/>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STOPPED"/>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_PART_LOADED"/>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_PART_UNLOADED"/>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_TIMEDOUT"/>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_FAILED"/>
            <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_FAILED_OVER"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_REJECTED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_CANCELLED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_TIMEDOUT"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_FAILED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_DEPLOY_FAILED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOY_FAILED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOYED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_UNDEPLOYED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STARTED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_STOPPED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_JOINED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_LEFT"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_FAILED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_NODE_SEGMENTED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLIENT_NODE_DISCONNECTED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLIENT_NODE_RECONNECTED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_DEPLOYED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_UNDEPLOYED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CLASS_DEPLOY_FAILED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOYED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_UNDEPLOYED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_DEPLOY_FAILED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STARTED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STOPPED"/>
             <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_NODES_LEFT"/>
          </list>
        </property>

        <property name="dataStreamerThreadPoolSize" value="64"/>

        <property name="peerClassLoadingEnabled" value="true"/>

        <!-- Hypothesis:  need more threads in this pool than data streamer -->
        <property name="systemThreadPoolSize" value="70"/>
        <!-- Use different sizes so we can predict which pool is saturated -->
        <property name="publicThreadPoolSize" value="24"/>
        <property name="igfsThreadPoolSize" value="40"/>
        <property name="stripedPoolSize" value="78"/>
        <property name="asyncCallbackPoolSize" value="56"/>
        <!-- Set 4 threads for rebalancing. -->
        <property name="rebalanceThreadPoolSize" value="4"/>

        <!-- Diagnose cause of 5s peer class loading timeout -->
        <property name="networkTimeout" value="60000"/>

        <property name="dataStorageConfiguration">
        <!-- The entire cluster uses Ignite Persistence (cluster-wide setting) -->
        <!-- Use separate directories that can be placed on different EBS volumes -->
            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
                <!-- Set the page size to 4 KB -->
                <property name="pageSize" value="4096"/>

                <!--  switched store/wal to understand higher BW behavior for WAL -->
                <property name="storagePath" value="/IgnitePersistenceStorage/wal"/>
                <property name="walPath" value="/IgnitePersistenceStorage/store"/>
                <property name="walArchivePath" value="/IgnitePersistenceStorage/wal/archive"/>

                <!-- Enable write throttling. -->
                <!-- property name="writeThrottlingEnabled" value="false"/ -->

                <property name="defaultDataRegionConfiguration">
                   <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                   <!-- Enabling persistence. -->
                   <property name="persistenceEnabled" value="true"/>

                   <!-- Increasing the buffer size to 1 GB. -->
                   <property name="checkpointPageBufferSize" value="#{1024L * 1024 * 1024}"/>
                   <property name="name" value="Default_Region"/>
                   <!-- Setting the size of the default region to 160GB. -->
                   <property name="maxSize" value="#{160L * 1024 * 1024 * 1024}"/>
                   </bean>
                 </property>
             </bean>
         </property>

    </bean>
    <!--  The S3 buckets used allow access to Ignite nodes themselves, so we don't need additional credentials -->
    <bean id="aws.creds" class="com.amazonaws.auth.AnonymousAWSCredentials"/>
  </beans>

20 minute 12x throughput drop using data streamer and Ignite persistence

Reply via email to