Hi Jens. I am using GCP to fire up 3 servers. The import is quick enough and the cluster and network looks ok then. Speed also looks fine between the 3 nodes.
I have these properties enabled when I start the server: java -server -agentpath:/home/r2d2/yourkit/bin/linux-x86-64/libyjpagent.so -javaagent:lib/aspectj/lib/aspectjweaver.jar -Dgemfire.EXPIRY_THREADS=20 -Dgemfire.PREFER_SERIALIZED=false *-Dgemfire.enable.network.partition.detection=false *-Dgemfire.autopdx.ignoreConstructor=true -Dgemfire.ALLOW_PERSISTENT_TRANSACTIONS=true -Dgemfire.member-timeout=600000 -Xmx90G -Xms90G -Xmn30G -XX:SurvivorRatio=1 -XX:MaxTenuringThreshold=15 -XX:CMSInitiatingOccupancyFraction=78 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -verbose:gc -Xloggc:/home/r2d2/rdb-geode-server/gc/gc-server.log -Djava.rmi.server.hostname='localhost' -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.rmi.port=9010 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false .....org.rdb.geode.server.GeodeServer Could this setting influence the cluster: *Dgemfire.enable.network.partition.detection=false* *I am seeing a lot of recovery messages:* [info 2018/10/16 15:32:26.867 UTC <Recovery thread for bucket > _B__net.lautus.gls.domain.life.instruction.instruction.rebalance. > AggregatePortfolioRebalanceChoice_92> tid=0x42c9] Initialization of > region _B__net.lautus.gls.domain.life.instruction.instruction.rebalance. > AggregatePortfolioRebalanceChoice_92 completed > [info 2018/10/14 11:19:17.329 SAST <RedundancyLogger for region > net.lautus.gls.domain.life.additionalfields.AdditionalFieldConfiguration> > tid=0x1858] Region > /net.lautus.gls.domain.life.additionalfields.AdditionalFieldConfiguration > (and any colocated sub-regions) has potentially stale data. Buckets [3] > are waiting for another offline member to recover the latest data. > My persistent id is: > DiskStore ID: 932530bc-4c45-4926-b4a1-6fe5fe1f0493 > Name: > Location: /10.154.0.2:/home/r2d2/rdb-geode-server/geode/tauDiskStore > > Offline members with potentially new data: > [ > DiskStore ID: c09e4cce-51e9-4111-8643-fe582677f49f > Location: /10.154.0.4:/home/r2d2/rdb-geode-server/geode/tauDiskStore > Buckets: [3] > ] > Use the "gfsh show missing-disk-stores" command to see all disk stores > that are being waited on by other members. > [info 2018/10/14 11:19:35.250 SAST <Pooled Waiting Message Processor 7> > tid=0x1318] Configured redundancy of 1 copies has been restored to > /net.lautus.gls.domain.life.additionalfields.AdditionalFieldConfiguration Btw using Apache Geode 1.7.0. Kindly Pieter On Wed, Oct 17, 2018 at 3:56 PM Jens Deppe <[email protected]> wrote: > Hi Pieter, > > Your startup times are definitely too long - probably at least an order > of magnitude. My first guess is that this is network related. This may > either be a DNS lookup issue or, if the the cluster is isolated from the > internet, it may be some problem with XSD validation needing internet > access (even though we do bundle the XSD files with Geode - should be the > same for Spring too). I will see if I can find any potential XSD issue. > > --Jens > > On Wed, Oct 17, 2018 at 3:22 AM Pieter van Zyl <[email protected]> > wrote: > >> Good day. >> >> We are currently running a 3 node Geode cluster. >> >> We are running the locator from gfsh and then staring up 3 servers with >> Spring that connects to the central locator. >> >> We are using persistence on all the regions and have basically one data >> and pdx store per node. >> >> The problem we are experiencing is that with no data aka clean cluster >> it take 75minutes to start up. >> >> Once data has been imported into the cluster and we shutdown all >> nodes/server and startup again it takes 128 to 160 minutes >> This is very slow. >> >> Question is is there anyway to improve the startup speed? Is this normal >> and expected speed? >> >> We have a 100gig database distributed across the 3 nodes. >> Server 1: 100 gig memory and 90 gig assigned heap and db size of 49gig >> and 32 cores. >> Server 2: 64 gig memory and 60 gig assigned heap and db size of 34gig and >> 16 cores >> Server 3: 64 gig memory and 60 gig assigned heap and db size of 34gig and >> 16 cores >> >> Should we have more data stores? Maybe separate stores for the partition >> vs replicated regions? >> >> <gfe:disk-store id="pdx-disk-store" allow-force-compaction="true" >> auto-compact="true" max-oplog-size="1024"> >> * <gfe:disk-dir location="geode/pdx"/>* >> </gfe:disk-store> >> >> <gfe:disk-store id="tauDiskStore" allow-force-compaction="true" >> auto-compact="true" max-oplog-size="5120" >> compaction-threshold="90"> >> * <gfe:disk-dir location="geode/tauDiskStore"/>* >> </gfe:disk-store> >> >> We have a mix of regions: >> >> Example partitioned region: >> >> <gfe:replicated-region id="net.lautus.gls.domain.life.accounting.Account" >> disk-store-ref="tauDiskStore" >> statistics="true" >> persistent="true"><!--<gfe:cache-listener ref="cacheListener"/>--> >> <gfe:eviction type="HEAP_PERCENTAGE" action="OVERFLOW_TO_DISK"/> >> </gfe:replicated-region> >> >> Example replicated region: >> <gfe:replicated-region id="org.rdb.internal.session.rootmap.RootMapHolder" >> disk-store-ref="tauDiskStore" >> statistics="true" persistent="true" >> > >> <!--<gfe:cache-listener ref="cacheListener"/>--> >> <gfe:eviction type="ENTRY_COUNT" action="OVERFLOW_TO_DISK" >> threshold="100"> >> <gfe:object-sizer ref="objectSizer"/> >> </gfe:eviction> >> </gfe:replicated-region> >> >> >> Any advice would be appreciated >> >> Kindly >> Pieter >> >
