[jira] [Created] (IGNITE-12325) GridCacheMapEntry reservation mechanism is broken with enabled cache store

2019-10-23 Thread Pavel Kovalenko (Jira)
Pavel Kovalenko created IGNITE-12325:


 Summary: GridCacheMapEntry reservation mechanism is broken with 
enabled cache store
 Key: IGNITE-12325
 URL: https://issues.apache.org/jira/browse/IGNITE-12325
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.8


Entry deferred deletion was disabled after 
https://issues.apache.org/jira/browse/IGNITE-11704 in transactional caches. 
However, if cache store is enabled there is a race with cache entry reservation 
after transactional remove and clear reservation after cache load:

{noformat}
java.lang.AssertionError: GridDhtCacheEntry [rdrs=ReaderId[] [ReaderId 
[nodeId=96c87c98-2524-4f9e-8a2f-6cfceda5, msgId=22663371, txFut=null], 
ReaderId [nodeId=68130805-0dc8-4ef4-abf7-7e7cde86, msgId=22663375, 
txFut=null], ReaderId [nodeId=b4a8abce-8d0e-4459-b93a-a734ad64, 
msgId=22663370, txFut=null]], part=8, super=GridDistributedCacheEntry 
[super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8, 
hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, 
nodeOrder=0], hash=8, extras=null, flags=2]]]
at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.clearReserveForLoad(GridCacheMapEntry.java:3616)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.clearReservationsIfNeeded(GridCacheAdapter.java:2429)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.access$400(GridCacheAdapter.java:179)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2309)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2217)
at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6963)
at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:844)

{noformat}

The issue can be resolved with enabled deferred delete if cache store is 
enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12299) Store tombstone links into separate BPlus tree to avoid partition full-scan during tombstones remove

2019-10-17 Thread Pavel Kovalenko (Jira)
Pavel Kovalenko created IGNITE-12299:


 Summary: Store tombstone links into separate BPlus tree to avoid 
partition full-scan during tombstones remove
 Key: IGNITE-12299
 URL: https://issues.apache.org/jira/browse/IGNITE-12299
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.9


Currently, we can't identify which keys are tombstones in the partition fastly. 
To collect tombstones we need to make a full-scan BPlus tree. It can slowdown 
node performance when rebalance is finished and tombstones cleanup is needed. 
We can introduce a separate BPlus tree (like for TTL) inside partition where we 
can store links to tombstone keys. When tombstones cleanup is needed we can 
make a fast scan for tombstones using the only a subset of the keys stored to 
this tree.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12298) Write tombstones on incomplete baseline to get rid of partition cleanup

2019-10-17 Thread Pavel Kovalenko (Jira)
Pavel Kovalenko created IGNITE-12298:


 Summary: Write tombstones on incomplete baseline to get rid of 
partition cleanup
 Key: IGNITE-12298
 URL: https://issues.apache.org/jira/browse/IGNITE-12298
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.9


After tombstone objects are introduced 
https://issues.apache.org/jira/browse/IGNITE-11704
we can write tombstones on OWNING nodes if the baseline is incomplete (some of 
the backup nodes are left). When baseline completes and old nodes return back 
we can avoid partition cleanup on those nodes before rebalance. We can 
translate the whole OWNING partition state including tombstones that will clear 
the data that was removed when node was offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12297) Detect lost partitions is not happened during cluster activation

2019-10-16 Thread Pavel Kovalenko (Jira)
Pavel Kovalenko created IGNITE-12297:


 Summary: Detect lost partitions is not happened during cluster 
activation
 Key: IGNITE-12297
 URL: https://issues.apache.org/jira/browse/IGNITE-12297
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.8


We invoke `detectLostPartitions` during PME only if there is a server join or 
server left.
However,  we can activate a persistent cluster where a partition may have 
MOVING status on all nodes. In this case, a partition may stay in MOVING state 
forever before any other topology event. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12255) Cache affinity fetching and calculation on client nodes may be broken in some cases

2019-10-03 Thread Pavel Kovalenko (Jira)
Pavel Kovalenko created IGNITE-12255:


 Summary: Cache affinity fetching and calculation on client nodes 
may be broken in some cases
 Key: IGNITE-12255
 URL: https://issues.apache.org/jira/browse/IGNITE-12255
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.7, 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


We have a cluster with server and client nodes.
We dynamically start several caches on a cluster.
Periodically we create and destroy some temporary cache in a cluster to move up 
cluster topology version.
At the same time, a random client node chooses a random existing cache and 
performs operations on that cache.
It leads to an exception on client node that affinity is not initialized for a 
cache during cache operation like:
Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]

This exception means that the last affinity for a cache is calculated on 
version [8,2]. This is a cache start version. It happens because during 
creating/destroying some temporary cache we don’t re-calculate affinity for all 
existing but not already accessed caches on client nodes. Re-calculate in this 
case is cheap - we just copy affinity assignment and increment topology version.

As a solution, we need to fetch affinity on client node join for all caches. 
Also, we need to re-calculate affinity for all affinity holders (not only for 
started caches or only configured caches) for all topology events that happened 
in a cluster on a client node.

This solution showed the existing race between client node join and concurrent 
cache destroy.

The race is the following:

Client node (with some configured caches) joins to a cluster sending 
SingleMessage to coordinator during client PME. This SingleMessage contains 
affinity fetch requests for all cluster caches. When SingleMessage is in-flight 
server nodes finish client PME and also process and finish cache destroy PME. 
When a cache is destroyed affinity for that cache is cleared. When 
SingleMessage delivered to coordinator it doesn’t have affinity for a requested 
cache because the cache is already destroyed. It leads to assertion error on 
the coordinator and unpredictable behavior on the client node.

The race may be fixed with the following change:

If the coordinator doesn’t have an affinity for requested cache from the client 
node, it doesn’t break PME with assertion error, just doesn’t send affinity for 
that cache to a client node. When the client node receives FullMessage and sees 
that affinity for some requested cache doesn’t exist, it just closes cache 
proxy for user interactions which throws CacheStopped exception for every 
attempt to use that cache. This is safe behavior because cache destroy event 
should be happened on the client node soon and destroy that cache completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12088) Cache or template name should be validated before attempt to start

2019-08-20 Thread Pavel Kovalenko (Jira)
Pavel Kovalenko created IGNITE-12088:


 Summary: Cache or template name should be validated before attempt 
to start
 Key: IGNITE-12088
 URL: https://issues.apache.org/jira/browse/IGNITE-12088
 Project: Ignite
  Issue Type: Bug
  Components: cache
Reporter: Pavel Kovalenko
 Fix For: 2.8


If set too long cache name it can be a cause of impossibility to create work 
directory for that cache:

{noformat}
[2019-08-20 
19:35:42,139][ERROR][exchange-worker-#172%node1%][IgniteTestResources] Critical 
system error detected. Will be handled accordingly to configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Failed to 
initialize cache working directory (failed to create, make sure the work folder 
has correct permissions): 
/home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration 
[name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, rebalanceTimeout=1, 
evictPlc=null, evictPlcFactory=null, onheapCache=false, sqlOnheapCache=false, 
sqlOnheapCacheMaxSize=0, evictFilter=null, eagerTtl=true, dfltLockTimeout=0, 
nearCfg=null, writeSync=null, storeFactory=null, storeKeepBinary=false, 
loadPrevVal=false, aff=null, cacheMode=PARTITIONED, atomicityMode=null, 
backups=6, invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
encryptionEnabled=false, diskPageCompression=null, 
diskPageCompressionLevel=null]0]]
class org.apache.ignite.IgniteCheckedException: Failed to initialize cache 
working directory (failed to create, make sure the work folder has correct 
permissions): 
/home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration 
[name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, rebalanceTimeout=1, 
evictPlc=null, evictPlcFactory=null, onheapCache=false, sqlOnheapCache=false, 
sqlOnheapCacheMaxSize=0, evictFilter=null, eagerTtl=true, dfltLockTimeout=0, 
nearCfg=null, writeSync=null, storeFactory=null, storeKeepBinary=false, 
loadPrevVal=false, aff=null, cacheMode=PARTITIONED, atomicityMode=null, 
backups=6, invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
encryptionEnabled=false, diskPageCompression=null, 
diskPageCompressionLevel=null]0
at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:769)
at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:748)
at 
org.apache.ignite.internal.processors.cache.CachesRegistry.persistCacheConfigurations(CachesRegistry.java:289)
at 
org.apache.ignite.internal.processors.cache.CachesRegistry.registerAllCachesAndGroups(CachesRegistry.java:264)
at 
org.apache.ignite.internal.processors.cache.CachesRegistry.update(CachesRegistry.java:202)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:850)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1306)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:846)
at 

[jira] [Created] (IGNITE-11852) Assertion errors when changing PME coordinator to locally joining node

2019-05-14 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-11852:


 Summary: Assertion errors when changing PME coordinator to locally 
joining node
 Key: IGNITE-11852
 URL: https://issues.apache.org/jira/browse/IGNITE-11852
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.7, 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


When PME coordinator changed to locally joining node several assertion errors 
may occur:
1. When some other joining nodes finished PME:

{noformat}
[13:49:58] (err) Failed to notify listener: 
o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1...@27296181java.lang.AssertionError:
 AffinityTopologyVersion [topVer=2, minorTopVer=0]
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1546)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1535)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.lambda$forAllRegisteredCacheGroups$e0a6939d$1(CacheAffinitySharedManager.java:1281)
at 
org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10929)
at 
org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10831)
at 
org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10811)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1280)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onLocalJoin(CacheAffinitySharedManager.java:1535)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processFullMessage(GridDhtPartitionsExchangeFuture.java:4189)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4731)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3400(GridDhtPartitionsExchangeFuture.java:145)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4622)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4611)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:466)
at 
org.apache.ignite.internal.util.future.GridCompoundFuture.checkComplete(GridCompoundFuture.java:281)
at 
org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:143)
at 
org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:44)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:455)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.onMessage(InitNewCoordinatorFuture.java:253)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2731)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1917)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1300(GridCachePartitionExchangeManager.java:162)
at 

[jira] [Created] (IGNITE-11773) JDBC suite hangs due to cleared non-serializable proxy objects

2019-04-18 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-11773:


 Summary: JDBC suite hangs due to cleared non-serializable proxy 
objects
 Key: IGNITE-11773
 URL: https://issues.apache.org/jira/browse/IGNITE-11773
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8



{noformat}
[01:53:02]W: [org.apache.ignite:ignite-clients] 
java.lang.AssertionError
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.testframework.junits.GridAbstractTest$SerializableProxy.readResolve(GridAbstractTest.java:2419)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.lang.reflect.Method.invoke(Method.java:498)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2078)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:141)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:93)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:163)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:81)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10039)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.deserialize(CacheConfigurationEnricher.java:151)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.enrich(CacheConfigurationEnricher.java:122)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.enrichFully(CacheConfigurationEnricher.java:143)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.getConfigFromTemplate(GridCacheProcessor.java:3776)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.dynamicTableCreate(GridQueryProcessor.java:1549)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.CommandProcessor.runCommandH2(CommandProcessor.java:437)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.CommandProcessor.runCommand(CommandProcessor.java:195)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.executeCommand(IgniteH2Indexing.java:954)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.querySqlFields(IgniteH2Indexing.java:1038)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2292)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2288)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2804)
[01:53:02]W: 

[jira] [Created] (IGNITE-11455) Introduce free lists rebuild mechanism

2019-02-28 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-11455:


 Summary: Introduce free lists rebuild mechanism
 Key: IGNITE-11455
 URL: https://issues.apache.org/jira/browse/IGNITE-11455
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.0
Reporter: Pavel Kovalenko
 Fix For: 2.8


Sometimes the state of free lists become invalid like in 
https://issues.apache.org/jira/browse/IGNITE-10669 . It leads the node to an 
unrecoverable state. At the same time, free lists don't hold any critical or 
data information and can be built from scratch using existing data pages. It 
may be useful to introduce a mechanism to rebuild free lists using an optimal 
algorithm of partition data pages scanning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10821) Caching affinity with affinity similarity key is broken

2018-12-26 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10821:


 Summary: Caching affinity with affinity similarity key is broken
 Key: IGNITE-10821
 URL: https://issues.apache.org/jira/browse/IGNITE-10821
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


When some cache groups have the same affinity function, number of partitions, 
backups and the same node filter they can use the same affinity distribution 
without needs for explicit recalculating. These parameters are called as 
"Affinity similarity key". 

In case of affinity recalculation caching affinity using this key may speed-up 
the process.

However, after https://issues.apache.org/jira/browse/IGNITE-9561 merge this 
mechanishm become broken, because parallell execution of affinity recalculation 
for the similar affinity groups leads to caching affinity misses.

To fix it we should couple together similar affinity groups and run affinity 
recalculation for them in one thread, caching previous results.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10799) Optimize affinity initialization/re-calculation

2018-12-24 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10799:


 Summary: Optimize affinity initialization/re-calculation
 Key: IGNITE-10799
 URL: https://issues.apache.org/jira/browse/IGNITE-10799
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.1
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


In case of persistence enabled and a baseline is set we have 2 main approaches 
to recalculate affinity:

{noformat}
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerJoinWithExchangeMergeProtocol
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerLeftWithExchangeMergeProtocol
{noformat}

Both of them following the same approach of recalculating:
1) Take a current baseline (ideal assignment).
2) Filter out offline nodes from it.
3) Choose new primary nodes if previous went away.
4) Place temporal primary nodes to late affinity assignment set.

Looking at implementation details we may notice that we do a lot of unnecessary 
online nodes cache lookups and array list copies. The performance becomes too 
slow if we do recalculate affinity for replicated caches (It takes P * N on 
each node, where P - partitions count, N - the number of nodes in the cluster). 
In case of large partitions count or large cluster, it may take few seconds, 
which is unacceptable, because this process happens during PME and freezes 
ongoing cluster operations.

We should investigate possible bottlenecks and improve the performance of 
affinity recalculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked

2018-12-20 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10771:


 Summary: Print troubleshooting hint when exchange latch got stucked
 Key: IGNITE-10771
 URL: https://issues.apache.org/jira/browse/IGNITE-10771
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.8


Sometimes users face with a problem when exchange latch can't be completed:
{noformat}
2018-12-12 07:07:57:563 [exchange-worker-#42] WARN 
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await 
partitions release latch within timeout: ClientLatch 
[coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, 
addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, 
loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, 
topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] 
{noformat}
It may indicate that some node in a cluster can' t finish partitions release 
(finish all ongoing operations at the previous topology version) or it can be 
silent network problem.
We should print to log a hint how to troubleshoot it to reduce the number of 
questions about such problem.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10749) Improve speed of checkpoint finalization on binary memory recovery

2018-12-20 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10749:


 Summary: Improve speed of checkpoint finalization on binary memory 
recovery
 Key: IGNITE-10749
 URL: https://issues.apache.org/jira/browse/IGNITE-10749
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.0
Reporter: Pavel Kovalenko
 Fix For: 2.8


Stopping node during checkpoint leads to binary memory recovery after node 
start.
When binary memory is restored node performs checkpoint that fixes the 
consistent state of the page memory.
It happens there

{noformat}
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#finalizeCheckpointOnRecovery
{noformat}

Looking at the implementation of this method we can notice that it performs 
finalization in 1 thread, which is not optimal. This process can be speed-up 
using parallelization of collecting checkpoint pages like in regular 
checkpoints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10625) Do first checkpoint on node start before join to topology

2018-12-10 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10625:


 Summary: Do first checkpoint on node start before join to topology
 Key: IGNITE-10625
 URL: https://issues.apache.org/jira/browse/IGNITE-10625
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.8


If a node joins to active cluster we do the first checkpoint during PME when 
partition states have restored here 
{code:java}
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopology#afterStateRestored
 
{code}
In IGNITE-9420 we moved logical recovery phase before joining to topology and 
currently when a node joins to active cluster it already has all recovered 
partitions. It means that we can safely do the first checkpoint after all 
logical updates are applied. This change will accelerate PME process if there 
were a lot of applied updates during recovery.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10624) Cache deployment id may be different that cluster-wide after recovery

2018-12-10 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10624:


 Summary: Cache deployment id may be different that cluster-wide 
after recovery
 Key: IGNITE-10624
 URL: https://issues.apache.org/jira/browse/IGNITE-10624
 Project: Ignite
  Issue Type: Bug
  Components: cache, sql
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


When schema for a cache is changing 
(GridQueryProcessor#processSchemaOperationLocal),
it may produce false-negative "CACHE_NOT_FOUND" message if a cache was started 
during recovery while cluster-wide descriptor was changed.

{noformat}
if (cacheInfo == null || !F.eq(depId, cacheInfo.dynamicDeploymentId()))
throw new 
SchemaOperationException(SchemaOperationException.CODE_CACHE_NOT_FOUND, 
cacheName); 
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10556) Attempt to decrypt data records during read-only metastorage recovery leads to NPE

2018-12-05 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10556:


 Summary: Attempt to decrypt data records during read-only 
metastorage recovery leads to NPE
 Key: IGNITE-10556
 URL: https://issues.apache.org/jira/browse/IGNITE-10556
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.8


Stacktrace:
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.lambda$next$0(GridCacheDatabaseSharedManager.java:4795)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.next(GridCacheDatabaseSharedManager.java:4799)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreLogicalState.next(GridCacheDatabaseSharedManager.java:4926)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2370)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:733)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4493)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
... 20 more
{noformat}

It happens because there is no encryption key for that cache group. Encryption 
keys are initialized after read-only metastorage is ready. There is a bug in 
RestoreStateContext which tries to filter out DataEntries in DataRecord by 
group id during read-only metastorage recovery. We should explicitly skip such 
records before filtering. As a possible solution, we should provide more 
flexible records filter to RestoreStateContext if we do recovery of read-only 
metastorage.

We should also return something more meaningful instead of null if no 
encryption key is found for DataRecord, as it can be a silent problem for 
components iterating over WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10493) Refactor exchange stages time measurements

2018-11-30 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10493:


 Summary: Refactor exchange stages time measurements
 Key: IGNITE-10493
 URL: https://issues.apache.org/jira/browse/IGNITE-10493
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.7
Reporter: Pavel Kovalenko
 Fix For: 2.8


At the current implementation, we don't cover and measure all possible code 
executions that influence on PME time. Instead of it we just measure the 
hottest separate parts with the following hardcoded pattern:
{noformat}
long time = currentTime();
... // some code block
print ("Stage name performed in " + (currentTime() - time));
{noformat}

This approach can be improved. Instead of declaring time variable and print the 
message to log immediately we can introduce a utility class (TimesBag) that 
will hold all stages and their times. The content of TimesBag can be printed 
when the exchange future is done.

As exchange is a linear process that executes init stage by exchange-worker and 
finish stage by one of the sys thread we can easily cover all exchange code 
base by time cutoffs.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10485) Ability to get know more about cluster state before NODE_JOINED event is fired cluster-wide

2018-11-29 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10485:


 Summary: Ability to get know more about cluster state before 
NODE_JOINED event is fired cluster-wide
 Key: IGNITE-10485
 URL: https://issues.apache.org/jira/browse/IGNITE-10485
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Reporter: Pavel Kovalenko
 Fix For: 2.8


Currently there are no good possibilities to get more knowledge about cluster 
before PME on node join start.

It might be usefult to do some pre-work (activate components if cluster is 
active, calculate baseline affinity, cleanup pds if baseline changed, etc.) 
before actual NODE_JOIN event is triggered cluster-wide and PME is started.
Such pre-work will significantly speed-up PME in case of node join.
Currently the only place where it can be done is during processing NodeAdded 
message on local joining node. 
But it's not a good idea, because it will freeze processing new discovery 
messages cluster-wide.

I see 2 ways how to implement it:

1) Introduce new intermediate state of node when it's discovered, but discovery 
event on node join is not triggered yet. This is right, but complicated change, 
because it requires revisiting joining process both in Tcp and Zk discovery 
protocols with extra failover scenarios.

2) Try to get this information and do pre-work before discovery manager start, 
using e.g. GridRestProcessor. This looks much simplier, but we can have some 
races there, when during pre-work cluster state has been changed (deactivation, 
baseline change). In this case we should rollback it or just stop/restart the 
node to avoid cluster instability. However these are rare scenarios in real 
world (e.g. start baseline node and start deactivation process right after node 
recovery is finished).

For starters we can expose baseline and cluster state in our rest endpoint and 
try to move out mentioned above pre-work things from PME. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10397) SQL Schema may be lost after cluster activation and simple query run

2018-11-23 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10397:


 Summary: SQL Schema may be lost after cluster activation and 
simple query run
 Key: IGNITE-10397
 URL: https://issues.apache.org/jira/browse/IGNITE-10397
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


Scenario:

1) Start 3 grids in a multithread mode with auto-activation.
2) Start the client.
3) Run a simple query like this
{noformat}
cache(DEFAULT_CACHE_NAME + 0).query(new SqlQuery<>(Integer.class, 
"1=1")).getAll();
{noformat}

Exception with the message that schema not found will be thrown:

{noformat}
[2018-11-23 
19:56:57,284][ERROR][query-#223%distributed.CacheMessageStatsIndexingTest2%][GridMapQueryExecutor]
 Failed to execute local query.
class org.apache.ignite.internal.processors.query.IgniteSQLException: Failed to 
set schema for DB connection for thread [schema=default0]
at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForThread(IgniteH2Indexing.java:549)
at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForSchema(IgniteH2Indexing.java:392)
at 
org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onQueryRequest0(GridMapQueryExecutor.java:767)
at 
org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onQueryRequest(GridMapQueryExecutor.java:637)
at 
org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onMessage(GridMapQueryExecutor.java:224)
at 
org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor$2.onMessage(GridMapQueryExecutor.java:184)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$ArrayListener.onMessage(GridIoManager.java:2333)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1091)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.h2.jdbc.JdbcSQLException: Schema "default0" not found; SQL 
statement:
SET SCHEMA "default0" [90079-195]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.engine.Database.getSchema(Database.java:1755)
at org.h2.command.dml.Set.update(Set.java:408)
at org.h2.command.CommandContainer.update(CommandContainer.java:101)
at org.h2.command.Command.executeUpdate(Command.java:260)
at org.h2.jdbc.JdbcStatement.executeUpdateInternal(JdbcStatement.java:137)
at org.h2.jdbc.JdbcStatement.executeUpdate(JdbcStatement.java:122)
at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForThread(IgniteH2Indexing.java:541)
... 13 more
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10298) Possible deadlock between restore partition states and checkpoint begin

2018-11-16 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10298:


 Summary: Possible deadlock between restore partition states and 
checkpoint begin
 Key: IGNITE-10298
 URL: https://issues.apache.org/jira/browse/IGNITE-10298
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.8


There is possible deadlock between "restorePartitionStates" phase during caches 
starting and currently running checkpointer:

{noformat}
"db-checkpoint-thread-#50" #89 prio=5 os_prio=0 tid=0x1ad57800 
nid=0x2b58 waiting on condition [0x7e42e000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xddabfcc8> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7515)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1331)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.fullSize(GridCacheOffheapManager.java:1459)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:3397)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:3009)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2934)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)

"exchange-worker-#42" #69 prio=5 os_prio=0 tid=0x1c1cd800 nid=0x259c 
waiting on condition [0x249ae000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x80b634a0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1328)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1212)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.initialUpdateCounter(GridCacheOffheapManager.java:1537)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onPartitionInitialCounterUpdated(GridCacheOffheapManager.java:633)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2365)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1174)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1119)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:703)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2364)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
{noformat}


Possible solution is performing 

[jira] [Created] (IGNITE-10235) Cache registered in QueryManager twice if parallel caches start is disabled

2018-11-13 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10235:


 Summary: Cache registered in QueryManager twice if parallel caches 
start is disabled
 Key: IGNITE-10235
 URL: https://issues.apache.org/jira/browse/IGNITE-10235
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.8


In case of disabled property IGNITE_ALLOW_START_CACHES_IN_PARALLEL callback 
that registers cache in QueryManager invoked twice which leads to impossibility 
to start cache if it was recovered before join to topology.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10226) Partition may restore wrong MOVING state during crash recovery

2018-11-12 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10226:


 Summary: Partition may restore wrong MOVING state during crash 
recovery
 Key: IGNITE-10226
 URL: https://issues.apache.org/jira/browse/IGNITE-10226
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.8


The way to get it exists only in versions that don't have IGNITE-9420:

1) Start cache, upload some data to partitions, forceCheckpoint
2) Start uploading additional data. Kill node. Node should be killed with 
skipping last checkpoint, or during checkpoint mark phase.
3) Re-start node. The crash recovery process for partitions started. When we 
create partition during crash recovery (topology().forceCreatePartition()) we 
log it's initial state to WAL. If we have any logical update relates to 
partition we'll log wrong MOVING state to the end of current WAL. This state 
will be considered as last valid when we process PartitionMetaStateRecord 
record's during logical recovery. In "restorePartitionsState" phase this state 
will be chosen as final and the partition will change to MOVING, even in page 
memory it has OWNING or something else.

To fix this problem in 2.4 - 2.7 versions, additional logging partition state 
change to WAL during crash recovery (logical recovery) should be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10035) Fix tests IgniteWalFormatFileFailoverTest

2018-10-28 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-10035:


 Summary: Fix tests IgniteWalFormatFileFailoverTest
 Key: IGNITE-10035
 URL: https://issues.apache.org/jira/browse/IGNITE-10035
 Project: Ignite
  Issue Type: New Feature
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


After IGNITE-9420 introduce, WAL Archiver component is started together with 
WAL manager. Tests suppose that WAL Archiver will be started after first 
activation, and proper file io factory will be injected to it. Need to find out 
how to inject file io factory before file archiver is started.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9725) Introduce affinity distribution prototype for equal cache group configurations

2018-09-27 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9725:
---

 Summary: Introduce affinity distribution prototype for equal cache 
group configurations
 Key: IGNITE-9725
 URL: https://issues.apache.org/jira/browse/IGNITE-9725
 Project: Ignite
  Issue Type: New Feature
  Components: cache
Affects Versions: 2.0
Reporter: Pavel Kovalenko
 Fix For: 2.8


Currently, we perform affinity re-calculation for each of cache groups, even if 
configurations (CacheMode, number of backups, affinity function) are equal.
 
If two cache groups have similar affinity function and number of backups we can 
calculate affinity prototype for such groups once and re-use in every cache 
group.

This change will save time on affinity re-calculation if a cluster has a lot of 
cache groups with similar affinity function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9683) Create manual pinger for ZK client

2018-09-25 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9683:
---

 Summary: Create manual pinger for ZK client
 Key: IGNITE-9683
 URL: https://issues.apache.org/jira/browse/IGNITE-9683
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.8


Connection loss with Zookeeper more than ZK session timeout for server nodes is 
unacceptable. To improve durability of connrction, we need to keep session with 
ZK as long possible. We need to introduce manual pinger additionally to ZK 
client  and ping ZK server with simple request each tick time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9661) Improve partition states validation

2018-09-21 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9661:
---

 Summary: Improve partition states validation
 Key: IGNITE-9661
 URL: https://issues.apache.org/jira/browse/IGNITE-9661
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently, we validate partition states one-by-one and the whole algorithm has 
complexity O (G * P * N * logP), where G - number of cache groups, P - number 
of partition in each of cache groups, N - the number of nodes. Overall 
complexity can be optimized (logP can be removed). We also should consider 
parallelization of algorithm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9649) Rework logging in important places

2018-09-19 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9649:
---

 Summary: Rework logging in important places
 Key: IGNITE-9649
 URL: https://issues.apache.org/jira/browse/IGNITE-9649
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.0
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


Currently, we have insufficient, incomplete or too sufficient logs at DEBUG and 
TRACE levels.

We should revisit and rework logging in important places of product:

1) Partitions Map Exchange

2) Rebalance

3) Partitions workflow

4) Time logging for critical places



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9562) Destroyed cache that resurrected on a old offline node breaks PME

2018-09-12 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9562:
---

 Summary: Destroyed cache that resurrected on a old offline node 
breaks PME
 Key: IGNITE-9562
 URL: https://issues.apache.org/jira/browse/IGNITE-9562
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.8


Given:
2 nodes, persistence enabled.
1) Stop 1 node
2) Destroy cache through client
3) Start stopped node

When the stopped node joins to cluster it starts all caches that it has seen 
before stopping.
If that cache was cluster-widely destroyed it leads to breaking the crash 
recovery process or PME.

Root cause - we don't start/collect caches from the stopped node on another 
part of a cluster.

In case of PARTITIONED cache mode that scenario breaks crash recovery:
{noformat}
java.lang.AssertionError: AffinityTopologyVersion [topVer=-1, minorTopVer=0]

at 
org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:696)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.updateLocal(GridDhtPartitionTopologyImpl.java:2449)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.afterStateRestored(GridDhtPartitionTopologyImpl.java:679)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2445)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2321)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1568)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1308)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1255)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:766)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2577)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2457)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
{noformat}

In case of REPLICATED cache mode that scenario breaks PME coordinator process:
{noformat}
[2018-09-12 
18:50:36,407][ERROR][sys-#148%distributed.CacheStopAndRessurectOnOldNodeTest0%][GridCacheIoManager]
 Failed to process message [senderId=4b6fd0d4-b756-4a9f-90ca-f0ee2511, 
messageType=class 
o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsSingleMessage]
java.lang.AssertionError: 3080586
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.clientTopology(GridCachePartitionExchangeManager.java:815)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updatePartitionSingleMap(GridDhtPartitionsExchangeFuture.java:3621)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2439)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:137)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2261)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2249)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2249)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1628)
at 

[jira] [Created] (IGNITE-9561) Optimize affinity initialization for started cache groups

2018-09-12 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9561:
---

 Summary: Optimize affinity initialization for started cache groups
 Key: IGNITE-9561
 URL: https://issues.apache.org/jira/browse/IGNITE-9561
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


At the end of
{noformat}
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#processCacheStartRequests
 
{noformat}
method we're initializing affinity for cache groups starting at current 
exchange.
We do it one-by-one and synchronously wait for AffinityFetchResponse for each 
of the starting groups. This is inefficient. We may parallelize this process 
and speed up caches starting process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9501) Exclude newly joining nodes from exchange latch

2018-09-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9501:
---

 Summary: Exclude newly joining nodes from exchange latch 
 Key: IGNITE-9501
 URL: https://issues.apache.org/jira/browse/IGNITE-9501
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently, we're waiting for latch completion from newly joining nodes. 
However, such nodes don't have any updates to be synced on wait partitions 
release. Newly joining nodes may start their caches before exchange latch 
creation and this can delay exchange process.

We should explicitly ignore such nodes and don't include them into latch 
participants.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9496) Add listenAsync method to GridFutureAdapter

2018-09-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9496:
---

 Summary: Add listenAsync method to GridFutureAdapter
 Key: IGNITE-9496
 URL: https://issues.apache.org/jira/browse/IGNITE-9496
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.7


Currently, there is no possibility to add an async listener to an internal 
future with the possibility to choose an appropriate executor for such listener.

This would be useful to change thread that will execute a future listener.

We should add listenAsync method to GridFutureAdapter and add the possibility 
to set arbitrary submitter/executor for such listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9494) Communication error resolver may be invoked when topology is under construction

2018-09-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9494:
---

 Summary: Communication error resolver may be invoked when topology 
is under construction
 Key: IGNITE-9494
 URL: https://issues.apache.org/jira/browse/IGNITE-9494
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.7


Zookeeper Discovery.
During massive node start and join to topology there can happen communication 
error problems which can lead to invoking communication error resolver.
Communication error resolver initiates a peer-to-peer ping process on all alive 
nodes. Youngest nodes in a cluster may have the not complete picture about 
alive nodes in a cluster. This can lead to a situation, that youngest node will 
not ping all available nodes, and the coordinator may decide that those nodes 
have an unstable network and unexpectedly kill them.
We should throttle communication error resolver in case of massive node join 
and give them a time to get the complete picture about topology.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9493) Communication error resolver shouldn't be invoked if connection with client breaks unexpectedly

2018-09-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9493:
---

 Summary: Communication error resolver shouldn't be invoked if 
connection with client breaks unexpectedly
 Key: IGNITE-9493
 URL: https://issues.apache.org/jira/browse/IGNITE-9493
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.7


Currently, we initiate communication error resolving process even if a 
connection between server and client breaks unexpectedly.

This is unnecessary action because client nodes are not important for cluster 
stability. We should ignore communication errors for client and daemon nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9492) Limit number of threads which process SingleMessage with exchangeId==null

2018-09-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9492:
---

 Summary: Limit number of threads which process SingleMessage with 
exchangeId==null
 Key: IGNITE-9492
 URL: https://issues.apache.org/jira/browse/IGNITE-9492
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently, after each PME coordinator spend a lot of time on processing 
correcting Single messages (with exchange id == null). This leads to growing 
inbound/outbound messages queue and delaying other coordinator-aware processes.

Processing single messages with exchange id == null are not so important to 
give all available resources to it. We should limit the number of sys-threads 
which are able to process such single messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9491) Exchange latch coordinator shouldn't be oldest node in a cluster

2018-09-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9491:
---

 Summary: Exchange latch coordinator shouldn't be oldest node in a 
cluster
 Key: IGNITE-9491
 URL: https://issues.apache.org/jira/browse/IGNITE-9491
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently, we have a lot of components having coordinator election ability. 
Each of these components is electing the oldest node as coordinator. It leads 
to overloading the oldest node and may be a cause of delaying of some processes.

The oldest node can have large inbound/outbound messages queue in large 
topologies which leads to delaying of processing Exchange Latch Ack messages. 
We should choose secondary oldest node as coordinator to unload the oldest 
coordinator. This change will significantly accelerate exchange latch waiting 
process.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9449) Lazy unmarshalling of discovery events in TcpDiscovery

2018-09-03 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9449:
---

 Summary: Lazy unmarshalling of discovery events in TcpDiscovery
 Key: IGNITE-9449
 URL: https://issues.apache.org/jira/browse/IGNITE-9449
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.6, 2.5, 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.7


Currently disco-msg-worker thread spend major part of time on disocvery message 
unmarshalling before send it to the next node. In most cases this is 
unnecessary and message can be send immediately after receiving and notyfing 
discovery-event-worker.
Responsibility of unmarshalling should moved to discovery-event-worker thread 
and this improvement will significantly reduce latency of sending custom 
messages across ring.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9420) Move logical recovery phase outside of PME

2018-08-29 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9420:
---

 Summary: Move logical recovery phase outside of PME
 Key: IGNITE-9420
 URL: https://issues.apache.org/jira/browse/IGNITE-9420
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.7


Currently, we perform logical recovery in PME here 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#restoreState
We should move logical recovery before discovery manager will start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9419) Avoid saving cache configuration synchronously during PME

2018-08-29 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9419:
---

 Summary: Avoid saving cache configuration synchronously during PME
 Key: IGNITE-9419
 URL: https://issues.apache.org/jira/browse/IGNITE-9419
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.7


Currently, we save cache configuration during PME at the activation phase here 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.CachesInfo#updateCachesInfo
 . We should avoid this, as it performs operations to a disk. We should save it 
asynchronously or lazy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9418) Avoid initialize file page store manager for caches during PME synchronously

2018-08-29 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9418:
---

 Summary: Avoid initialize file page store manager for caches 
during PME synchronously
 Key: IGNITE-9418
 URL: https://issues.apache.org/jira/browse/IGNITE-9418
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently, we do creation for partition and index files during PME for starting 
caches at the beginning of 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#readCheckpointAndRestoreMemory
 method.
We should avoid this because sometimes it took a long time as we perform 
writing to disk.
 If the cache was registered during PME we should initialize page store lazy or 
asynchronously.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9398) Reduce time on processing CustomDiscoveryMessage by discovery worker

2018-08-28 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9398:
---

 Summary: Reduce time on processing CustomDiscoveryMessage by 
discovery worker
 Key: IGNITE-9398
 URL: https://issues.apache.org/jira/browse/IGNITE-9398
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.6, 2.5, 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.7


Processing discovery CustomMessage may take significant values of time (0.5-0.7 
seconds) before sending to next node in the topology. This significantly 
accumulates the total time of PME if topology has multiple nodes.
Let X = time of processing discovery message by discovery-msg-worker on each 
node before sending to next node. 
Let N = number of nodes in the topology.
Then the minimal total time of exchange will be:
T = N * X

We shouldn't make heavy actions when process discovery message. Best solution 
will be separated thread that will do it, while discovery-msg-worker will just 
pass a message to that thread and send a message immediately to another node in 
topology.

This affects both TcpDiscoverySpi and ZkDiscoverySpi.

{noformat}
[11:59:33,134][INFO][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Enqueued message 
type = TcpDiscoveryCustomEventMessage id = 
e4b542b6561-a38dfe31-dcfd-430b-acb3-5a531db4197e time = 0
[11:59:33,537][INFO][tcp-disco-msg-worker-#2][GridSnapshotAwareClusterStateProcessorImpl]
 Received activate request with BaselineTopology[id=0]
[11:59:33,549][INFO][tcp-disco-msg-worker-#2][GridSnapshotAwareClusterStateProcessorImpl]
 Started state transition: true
[11:59:33,752][INFO][exchange-worker-#62][time] Started exchange init 
[topVer=AffinityTopologyVersion [topVer=110, minorTopVer=1], crd=true, 
evt=DISCOVERY_CUSTOM_EVT, evtNode=a38dfe31-dcfd-430b-acb3-5a531db4197e, 
customEvt=ChangeGlobalStateMessage 
[id=cea542b6561-47395de6-c204-4576-a0a3-99ec53d41ac3, 
reqId=5b651439-7a6a-43fc-9cb0-d646c3380576, 
initiatingNodeId=a38dfe31-dcfd-430b-acb3-5a531db4197e, activate=true, 
baselineTopology=BaselineTopology [id=0, branchingHash=-69412111965, 
branchingType='New BaselineTopology', baselineNodes=[node42, node43, node44, 
node45, node46, node47, node48, node49, node50, node51, node52, node53, node54, 
node55, node56, node57, node58, node59, node1, node4, node5, node2, node3, 
node8, node9, node6, node7, node60, node61, node62, node63, node64, node65, 
node66, node67, node68, node69, node70, node71, node72, node73, node74, node75, 
node76, node77, node78, node79, node80, node81, node82, node83, node84, node85, 
node86, node87, node88, node89, node90, node91, node92, node93, node94, node95, 
node96, node97, node10, node98, node11, node99, node12, node13, node14, node15, 
node16, node100, node17, node18, node19, node108, node107, node106, node105, 
node104, node103, node102, node101, node109, node20, node21, node22, node23, 
node24, node25, node26, node27, node28, node29, node110, node30, node31, 
node32, node33, node34, node35, node36, node37, node38, node39, node40, 
node41]], forceChangeBaselineTopology=false, timestamp=1535101173015], 
allowMerge=false]
[11:59:33,753][INFO][exchange-worker-#62][GridDhtPartitionsExchangeFuture] 
Start activation process [nodeId=1906b9c3-73f4-4c30-85cc-cf6b99c3bab9, 
client=false, topVer=AffinityTopologyVersion [topVer=110, minorTopVer=1]]
[11:59:33,756][INFO][exchange-worker-#62][FilePageStoreManager] Resolved page 
store work directory: 
/storage/ssd/avolkov/tiden/snapshots-180824-114937/test_pitr/ignite.server.1/work/db/node1
[11:59:33,756][INFO][exchange-worker-#62][FileWriteAheadLogManager] Resolved 
write ahead log work directory: 
/storage/ssd/avolkov/tiden/snapshots-180824-114937/test_pitr/ignite.server.1/work/db/wal/node1
[11:59:33,756][INFO][exchange-worker-#62][FileWriteAheadLogManager] Resolved 
write ahead log archive directory: 
/storage/ssd/avolkov/tiden/snapshots-180824-114937/test_pitr/ignite.server.1/work/db/wal/archive/node1
[11:59:33,757][INFO][exchange-worker-#62][FileWriteAheadLogManager] Started 
write-ahead log manager [mode=LOG_ONLY]
[11:59:33,763][INFO][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Processed 
message type = TcpDiscoveryCustomEventMessage id = 
e4b542b6561-a38dfe31-dcfd-430b-acb3-5a531db4197e time = 629
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9271) Implement transaction commit using thread per partition model

2018-08-14 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9271:
---

 Summary: Implement transaction commit using thread per partition 
model
 Key: IGNITE-9271
 URL: https://issues.apache.org/jira/browse/IGNITE-9271
 Project: Ignite
  Issue Type: Sub-task
  Components: cache
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently, we perform commit of a transaction from sys thread and do write 
operations with multiple partitions.
We should delegate such operations to an appropriate thread and wait for 
results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9270) Design thread per partition model

2018-08-14 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9270:
---

 Summary: Design thread per partition model
 Key: IGNITE-9270
 URL: https://issues.apache.org/jira/browse/IGNITE-9270
 Project: Ignite
  Issue Type: Sub-task
  Components: cache
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


A new model of executing cache partition operations (READ, CREATE, UPDATE, 
DELETE) should satisfy following conditions
1) All modify operations (CREATE, UPDATE, DELETE) on some partition must be 
performed by the same thread. 
2) Read operations can be executed by any thread.

We should investigate performance if we choose dedicated executor service for 
such operations, or we can use a messaging model to use network threads to 
perform such operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9206) Node can't join to ring if all existing nodes have stopped and another new node joined ahead

2018-08-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9206:
---

 Summary: Node can't join to ring if all existing nodes have 
stopped and another new node joined ahead
 Key: IGNITE-9206
 URL: https://issues.apache.org/jira/browse/IGNITE-9206
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.7


TcpDiscovery SPI problem.
Situation:
Existing cluster with nodes 1 and 2. Nodes 1 and 2 are stopping.
1) Node 3 joins to cluster and sends JoinMessage to node 2.
2) Node 2 is stopping and unable to handle JoinMessage from node 3. Node 3 
choose node 1 as the next node to send the message.
3) Node 3 sends JoinMessage to node 1.
4) Node 4 joins to cluster.
5) Node 1 is stopping and unable to handle JoinMessage from node 3.
6) Node 4 sees that there are no alive nodes in the ring at the time and become 
the first node in the topology.
7) Node 3 sends JoinMessage to Node 4 and this process repeats again and again 
without any success.
At Node 4 logs we can see that remote connection from Node 3 is established but 
no active actions have performed. Node 3 leaves in CONNECTING state forever. At 
the same time Node 4 thinks that Node 3 is already in the ring.

Failed test:
GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange

Link to TC:
https://ci.ignite.apache.org/viewLog.html?buildId=1594376=buildResultsDiv=IgniteTests24Java8_DataStructures

Shrinked log:
{code:java}
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,733][INFO ][main][root] >>> 
Stopping grid 
[name=replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0, 
id=3e2c94bd-8e98-4dd9-8d1a-befbfe00]
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,739][INFO 
][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][root] 
Start node: replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,740][INFO 
][tcp-disco-msg-worker-#2146%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest6%][TcpDiscoverySpi]
 New next node [newNext=TcpDiscoveryNode 
[id=3e2c94bd-8e98-4dd9-8d1a-befbfe00, addrs=ArrayList [127.0.0.1], 
sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, 
lastExchangeTime=1533416953738, loc=false, ver=2.7.0#20180803-sha1:3ab8bbad, 
isClient=false]]
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,741][INFO 
][tcp-disco-srvr-#2100%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0%][TcpDiscoverySpi]
 TCP discovery accepted incoming connection [rmtAddr=/127.0.0.1, rmtPort=50099]
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,741][INFO 
][tcp-disco-srvr-#2100%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0%][TcpDiscoverySpi]
 TCP discovery spawning a new thread for connection [rmtAddr=/127.0.0.1, 
rmtPort=50099]
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,743][INFO 
][tcp-disco-sock-reader-#2151%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0%][TcpDiscoverySpi]
 Started serving remote node connection [rmtAddr=/127.0.0.1:50099, 
rmtPort=50099]
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,746][INFO 
][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][GridCacheReplicatedDataStructuresFailoverSelfTest7]
 
[00:09:13] : [Step 3/4] 
[00:09:13] : [Step 3/4] >>>__    
[00:09:13] : [Step 3/4] >>>   /  _/ ___/ |/ /  _/_  __/ __/  
[00:09:13] : [Step 3/4] >>>  _/ // (7 7// /  / / / _/
[00:09:13] : [Step 3/4] >>> /___/\___/_/|_/___/ /_/ /___/   
[00:09:13] : [Step 3/4] >>> 
[00:09:13] : [Step 3/4] >>> ver. 2.7.0-SNAPSHOT#20180803-sha1:3ab8bbad
[00:09:13] : [Step 3/4] >>> 2018 Copyright(C) Apache Software Foundation
[00:09:13] : [Step 3/4] >>> 
[00:09:13] : [Step 3/4] >>> Ignite documentation: http://ignite.apache.org
[00:09:13] : [Step 3/4] 
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,746][INFO 
][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][GridCacheReplicatedDataStructuresFailoverSelfTest7]
 Config URL: n/a
[00:09:13] : [Step 3/4] [2018-08-04 21:09:13,747][INFO 
][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][GridCacheReplicatedDataStructuresFailoverSelfTest7]
 IgniteConfiguration 
[igniteInstanceName=replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7,
 pubPoolSize=8, svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, 
sysPoolSize=8, mgmtPoolSize=4, igfsPoolSize=5, dataStreamerPoolSize=8, 
utilityCachePoolSize=8, utilityCacheKeepAliveTime=6, p2pPoolSize=2, 
qryPoolSize=8, igniteHome=/data/teamcity/work/9198da4c51c3e112, 
igniteWorkDir=/data/teamcity/work/9198da4c51c3e112/work, 
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@13fed1ec, 
nodeId=fe9e7ca7-c0fa-4b51-8a87-1255f8c7, marsh=BinaryMarshaller [], 

[jira] [Created] (IGNITE-9185) Collect and check update counters visited during WAL rebalance

2018-08-03 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9185:
---

 Summary: Collect and check update counters visited during WAL 
rebalance
 Key: IGNITE-9185
 URL: https://issues.apache.org/jira/browse/IGNITE-9185
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently we don't check what update counters we visit during WAL iteration and 
what data we send to a demander node. There can be situation, that we met last 
requested update counter in WAL and stop rebalance process, while due to 
possible DataRecord's reordering we miss some updates after.
If rebalance process breaks due to end of WAL, but not all data records were 
visit, we can easily check what records are missed, cancel rebalance and print 
useful information to log for further debug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9157) Optimize memory usage of data regions in tests

2018-08-01 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9157:
---

 Summary: Optimize memory usage of data regions in tests
 Key: IGNITE-9157
 URL: https://issues.apache.org/jira/browse/IGNITE-9157
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.6
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


If we use persistence in tests and do not explicitly set the max size of a data 
region, by default it will be 20% of available RAM on a host. This can lead to 
memory over-usage and sometimes JVMs, where such tests are running, will be 
killed by Linux OOM killer.
We should find all tests where data region max size has forgotten and set this 
value explicitly to minimal possible value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9129) P2P class deployment is failed when using ZK discovery

2018-07-30 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9129:
---

 Summary: P2P class deployment is failed when using ZK discovery
 Key: IGNITE-9129
 URL: https://issues.apache.org/jira/browse/IGNITE-9129
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.6, 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


In case of using Zookeeper Discovery, cluster node which joins to a cluster 
receives information that some user-classes have been already deployed but 
doesn't exist in the local classpath. In this case, the node tries to request 
for these classes from nodes that contain it, but do it synchronously during 
Zookeeper Discovery starting and gets NullPointer when first topology snapshot 
has not initialized yet.
We should request for user-classes asynchronously and only after the first 
topology has initialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9121) Revisit future.get() usages when process message from Communication SPI

2018-07-30 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9121:
---

 Summary: Revisit future.get() usages when process message from 
Communication SPI
 Key: IGNITE-9121
 URL: https://issues.apache.org/jira/browse/IGNITE-9121
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.6, 2.5
Reporter: Pavel Kovalenko


Currently, we use explicit synchronous future.get() when process messages from 
Communication SPI. This potentially may lead to deadlocks to thread-pool 
exhausting as was showed in IGNITE-9111 e.g.
To fix the problem we should determine all places in the code where we 
synchronously wait for some futures and try to either refactor these places or 
implement a special exception (which will contain such future) with subsequent 
retrying a runnable in low-level Communication SPI processing when future will 
be completed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9111) Do not wait for deactivation in GridClusterStateProcessor#publicApiActiveState

2018-07-27 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9111:
---

 Summary: Do not wait for deactivation in 
GridClusterStateProcessor#publicApiActiveState
 Key: IGNITE-9111
 URL: https://issues.apache.org/jira/browse/IGNITE-9111
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5, 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Currently, we wait for activation/deactivation future when check state of the 
cluster. But when deactivation is in progress it doesn't make sense to wait for 
it, because after the successful wait we will throw an exception that cluster 
is not active. Synchronous waiting for deactivation future may lead to 
deadlocks if operation obtains some locks before checking cluster state.

As the solution, we should check and wait only for activation futures. In case 
of in-progress deactivation, we should fail fast and return "false" from 
publicApiActiveState method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9088) Add ability to dump persistence after particular test

2018-07-26 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9088:
---

 Summary: Add ability to dump persistence after particular test
 Key: IGNITE-9088
 URL: https://issues.apache.org/jira/browse/IGNITE-9088
 Project: Ignite
  Issue Type: Improvement
  Components: persistence
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


Sometimes it's needed to analyze persistence after a particular test finish on 
TeamCity.
We need to add an ability to dump persistence dirs/files to the specified 
directory on test running host for further analysis.
This should be managed by a property.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9086) Error during commit transaction on primary node may lead to breaking transaction data integrity

2018-07-26 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9086:
---

 Summary: Error during commit transaction on primary node may lead 
to breaking transaction data integrity
 Key: IGNITE-9086
 URL: https://issues.apache.org/jira/browse/IGNITE-9086
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.6, 2.5, 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.7


Transaction properties are PESSIMISTIC, REPEATABLE READ.

If primary partitions participating in the transaction are spread across 
several nodes and commit is failed on some of the primary nodes while other 
primary nodes have committed transaction it may lead to breaking transaction 
data integrity. A data become inconsistent even after rebalance when the node 
with failed commit returns back to the cluster.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9084) Trash in WAL after node stop may affect WAL rebalance

2018-07-25 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9084:
---

 Summary: Trash in WAL after node stop may affect WAL rebalance
 Key: IGNITE-9084
 URL: https://issues.apache.org/jira/browse/IGNITE-9084
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.6
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


During iteration over WAL we can face with trash in WAL segment, which can 
remains after node restart. We should handle this situation in WAL rebalance 
iterator and gracefully stop iteration process.

{noformat}
[2018-07-25 
17:18:21,152][ERROR][sys-#25385%persistence.IgnitePdsTxHistoricalRebalancingTest0%][GridCacheIoManager]
 Failed to process message [senderId=f0d35df7-ff93-4b6c-b699-45f3e7c3, 
messageType=class 
o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionDemandMessage]
class org.apache.ignite.IgniteException: Failed to read WAL record at position: 
19346739 size: 67108864
at 
org.apache.ignite.internal.util.lang.GridIteratorAdapter.next(GridIteratorAdapter.java:38)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.advance(GridCacheOffheapManager.java:1033)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.next(GridCacheOffheapManager.java:948)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.nextX(GridCacheOffheapManager.java:917)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.nextX(GridCacheOffheapManager.java:842)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.IgniteRebalanceIteratorImpl.nextX(IgniteRebalanceIteratorImpl.java:130)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.IgniteRebalanceIteratorImpl.next(IgniteRebalanceIteratorImpl.java:185)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.IgniteRebalanceIteratorImpl.next(IgniteRebalanceIteratorImpl.java:37)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:348)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:370)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:380)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:365)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to read WAL 
record at position: 19346739 size: 67108864
at 
org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.handleRecordException(AbstractWalRecordsIterator.java:263)
at 
org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.advanceRecord(AbstractWalRecordsIterator.java:229)
at 
org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.advance(AbstractWalRecordsIterator.java:149)
at 
org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.onNext(AbstractWalRecordsIterator.java:115)

[jira] [Created] (IGNITE-9082) Throwing checked exception during tx commit without node stopping leads to data corruption

2018-07-25 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-9082:
---

 Summary: Throwing checked exception during tx commit without node 
stopping leads to data corruption
 Key: IGNITE-9082
 URL: https://issues.apache.org/jira/browse/IGNITE-9082
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.6, 2.5, 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.7


If we get checked exception during tx commit on a primary node and this 
exception is not supposed to be handled as NodeStopping OR doesn't lead to node 
stop using Failure Handler, in this case, we may get data loss on a node which 
is a backup node for this tx.

Possible solution:
If we get any checked or unchecked exception during tx commit we should stop 
this node after that to prevent further data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8904) Add rebalanceThreadPoolSize to nodes configuration consistency check

2018-07-02 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8904:
---

 Summary: Add rebalanceThreadPoolSize to nodes configuration 
consistency check
 Key: IGNITE-8904
 URL: https://issues.apache.org/jira/browse/IGNITE-8904
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5, 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.7


If supplier node has less thread-pool size than demander node, rebalance 
process between them will hang indefinitely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8848) Introduce new split-brain tests when topology is under load

2018-06-21 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8848:
---

 Summary: Introduce new split-brain tests when topology is under 
load
 Key: IGNITE-8848
 URL: https://issues.apache.org/jira/browse/IGNITE-8848
 Project: Ignite
  Issue Type: Improvement
  Components: cache, zookeeper
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


We should check following cases:
1) Primary node of transaction located at a part of a cluster that will 
survive, while backup doesn't.
2) Backup node of transaction located at a part of a cluster that will survive, 
while primary doesn't.
3) A client has a connection to both split-brained parts.
4) A client has a connection to only 1 part of a split cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8844) Provide example how to implement auto-activation policy when cluster is activated first time

2018-06-21 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8844:
---

 Summary: Provide example how to implement auto-activation policy 
when cluster is activated first time
 Key: IGNITE-8844
 URL: https://issues.apache.org/jira/browse/IGNITE-8844
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5, 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.6


Some of the our users which use Ignite embedded face with the problem how to 
activate cluster first time, when no first baseline established.
We should provide an example of such policy as we did it with BaselineWatcher. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8835) Do not skip distributed phase of 2-phase partition release if there are some caches to stop / modify

2018-06-19 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8835:
---

 Summary: Do not skip distributed phase of 2-phase partition 
release if there are some caches to stop / modify
 Key: IGNITE-8835
 URL: https://issues.apache.org/jira/browse/IGNITE-8835
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


If we don't perform distributed 2-phase in case of cache stop, we can lost some 
transactional updates from primary to backup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8793) Introduce metrics for File I/O operations to monitor disk performance

2018-06-14 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8793:
---

 Summary: Introduce metrics for File I/O operations to monitor disk 
performance
 Key: IGNITE-8793
 URL: https://issues.apache.org/jira/browse/IGNITE-8793
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.6


It would be good to introduce some kind of wrapper for File I/O to measure 
read/write times for 
better understanding what is happening with persistence. Measurements should be 
exposed as JMX-metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8791) IgnitePdsTxCacheRebalancingTest.testTopologyChangesWithConstantLoad fails on TC

2018-06-14 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8791:
---

 Summary: 
IgnitePdsTxCacheRebalancingTest.testTopologyChangesWithConstantLoad fails on TC
 Key: IGNITE-8791
 URL: https://issues.apache.org/jira/browse/IGNITE-8791
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


{noformat}
junit.framework.AssertionFailedError: 46 8204 expected: but was:
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8785) Node may hang indefinitely in CONNECTING state during cluster segmentation

2018-06-13 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8785:
---

 Summary: Node may hang indefinitely in CONNECTING state during 
cluster segmentation
 Key: IGNITE-8785
 URL: https://issues.apache.org/jira/browse/IGNITE-8785
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.6


Affected test: 
org.apache.ignite.internal.processors.cache.IgniteTopologyValidatorGridSplitCacheTest#testTopologyValidatorWithCacheGroup

Node hangs with following stacktrace:

{noformat}
"grid-starter-testTopologyValidatorWithCacheGroup-22" #117619 prio=5 os_prio=0 
tid=0x7f17dd19b800 nid=0x304a in Object.wait() [0x7f16b19df000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:931)
- locked <0x000705ee4a60> (a java.lang.Object)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:373)
at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:1948)
at 
org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297)
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:915)
at 
org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1739)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1046)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2014)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1723)
- locked <0x000705995ec0> (a 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1151)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:649)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:882)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:845)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:833)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:799)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$3.call(GridAbstractTest.java:742)
at 
org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:86)
{noformat}

It seems that node never receives acknowledgment from coordinator.

There were some failure before:

{noformat}
[org.apache.ignite:ignite-core] [2018-06-10 04:59:18,876][WARN 
][grid-starter-testTopologyValidatorWithCacheGroup-22][IgniteCacheTopologySplitAbstractTest$SplitTcpDiscoverySpi]
 Node has not been connected to topology and will repeat join process. Check 
remote nodes logs for possible error messages. Note that large topology may 
require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
configuration property if getting this message on the starting nodes 
[networkTimeout=5000]
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8784) Deadlock during simultaneous client reconnect and node stop

2018-06-13 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8784:
---

 Summary: Deadlock during simultaneous client reconnect and node 
stop
 Key: IGNITE-8784
 URL: https://issues.apache.org/jira/browse/IGNITE-8784
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.6



{noformat}
[18:48:22,665][ERROR][tcp-client-disco-msg-worker-#467%client%][IgniteKernal%client]
 Failed to reconnect, will stop node
class org.apache.ignite.IgniteException: Failed to wait for local node joined 
event (grid is stopping).
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2193)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.onKernalStart(GridCachePartitionExchangeManager.java:583)
at 
org.apache.ignite.internal.processors.cache.GridCacheSharedContext.onReconnected(GridCacheSharedContext.java:396)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onReconnected(GridCacheProcessor.java:1159)
at 
org.apache.ignite.internal.IgniteKernal.onReconnected(IgniteKernal.java:3915)
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:830)
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery(GridDiscoveryManager.java:589)
at 
org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2423)
at 
org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2402)
at 
org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeAddFinishedMessage(ClientImpl.java:2047)
at 
org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1896)
at 
org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1788)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to wait for 
local node joined event (grid is stopping).
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.onKernalStop0(GridDiscoveryManager.java:1657)
at 
org.apache.ignite.internal.managers.GridManagerAdapter.onKernalStop(GridManagerAdapter.java:652)
at org.apache.ignite.internal.IgniteKernal.stop0(IgniteKernal.java:2218)
at org.apache.ignite.internal.IgniteKernal.stop(IgniteKernal.java:2166)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2588)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop(IgnitionEx.java:2551)
at org.apache.ignite.internal.IgnitionEx.stop(IgnitionEx.java:372)
at org.apache.ignite.Ignition.stop(Ignition.java:229)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.stopGrid(GridAbstractTest.java:1088)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1128)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1109)
at 
org.gridgain.grid.internal.processors.cache.database.IgniteDbSnapshotNotStableTopologiesTest.afterTest(IgniteDbSnapshotNotStableTopologiesTest.java:250)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.tearDown(GridAbstractTest.java:1694)
at 
org.apache.ignite.testframework.junits.common.GridCommonAbstractTest.tearDown(GridCommonAbstractTest.java:492)
at junit.framework.TestCase.runBare(TestCase.java:146)
at junit.framework.TestResult$1.protect(TestResult.java:122)
at junit.framework.TestResult.runProtected(TestResult.java:142)
at junit.framework.TestResult.run(TestResult.java:125)
at junit.framework.TestCase.run(TestCase.java:129)
at junit.framework.TestSuite.runTest(TestSuite.java:255)
at junit.framework.TestSuite.run(TestSuite.java:250)
at junit.framework.TestSuite.runTest(TestSuite.java:255)
at junit.framework.TestSuite.run(TestSuite.java:250)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 

[jira] [Created] (IGNITE-8780) File I/O operations must be retried if buffer hasn't read/written completely

2018-06-13 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8780:
---

 Summary: File I/O operations must be retried if buffer hasn't 
read/written completely
 Key: IGNITE-8780
 URL: https://issues.apache.org/jira/browse/IGNITE-8780
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.6


Currently we don't actually ensure that we write or read some buffer completely:
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#writeCheckpointEntry
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#nodeStart

As result we may not write to the disk actual data and after node restart we 
can get BufferUnderflowException, like this:

{noformat}
java.nio.BufferUnderflowException
at java.nio.Buffer.nextGetIndex(Buffer.java:506)
at java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:412)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readPointer(GridCacheDatabaseSharedManager.java:1915)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointStatus(GridCacheDatabaseSharedManager.java:1892)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:565)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.start0(GridCacheDatabaseSharedManager.java:525)
at 
org.apache.ignite.internal.processors.cache.GridCacheSharedManagerAdapter.start(GridCacheSharedManagerAdapter.java:61)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.start(GridCacheProcessor.java:700)
at 
org.apache.ignite.internal.IgniteKernal.startProcessor(IgniteKernal.java:1738)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:985)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2014)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1723)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1151)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:671)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:596)
at org.apache.ignite.Ignition.start(Ignition.java:327)
at org.apache.ignite.ci.db.TcHelperDb.start(TcHelperDb.java:67)
at 
org.apache.ignite.ci.web.CtxListener.contextInitialized(CtxListener.java:37)
at 
org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:890)
at 
org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:532)
at 
org.eclipse.jetty.server.handler.ContextHandler.startContext(ContextHandler.java:853)
at 
org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:344)
at 
org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1501)
at 
org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1463)
at 
org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:785)
at 
org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:261)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:545)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at 
org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
at org.eclipse.jetty.server.Server.start(Server.java:452)
at 
org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105)
at 
org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.server.Server.doStart(Server.java:419)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.apache.ignite.ci.web.Launcher.runServer(Launcher.java:68)
at 
org.apache.ignite.ci.TcHelperJettyLauncher.main(TcHelperJettyLauncher.java:10)
{noformat}

and node become into unrecoverable state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8750) IgniteWalFlushDefaultSelfTest.testFailAfterStart fails on TC

2018-06-08 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8750:
---

 Summary: IgniteWalFlushDefaultSelfTest.testFailAfterStart fails on 
TC
 Key: IGNITE-8750
 URL: https://issues.apache.org/jira/browse/IGNITE-8750
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


{noformat}
org.apache.ignite.IgniteException: Failed to get object field 
[obj=GridCacheSharedManagerAdapter [starting=true, stop=false], 
fieldNames=[mmap]]
Caused by: java.lang.NoSuchFieldException: mmap
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8691) Get rid of tests jar artifact in ignite-zookeeper module

2018-06-04 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8691:
---

 Summary: Get rid of tests jar artifact in ignite-zookeeper module
 Key: IGNITE-8691
 URL: https://issues.apache.org/jira/browse/IGNITE-8691
 Project: Ignite
  Issue Type: Bug
  Components: zookeeper
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


Currently Ignite building process produces 
{noformat}
org/apache/ignite/ignite-zookeeper/2.X.X/ignite-zookeeper-2.X.X-tests.jar
{noformat}
artifact which seems to be useless and should be excluded as result of 
packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8690) Missed package-info for some packages

2018-06-04 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8690:
---

 Summary: Missed package-info for some packages
 Key: IGNITE-8690
 URL: https://issues.apache.org/jira/browse/IGNITE-8690
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


List of affected packages:

{noformat}
org.apache.ignite.spi.communication.tcp.internal
org.apache.ignite.spi.discovery.zk
org.apache.ignite.spi.discovery.zk.internal
org.apache.ignite.ml.structures.partition
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8688) Pending tree is initialized outside of checkpoint lock

2018-06-04 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8688:
---

 Summary: Pending tree is initialized outside of checkpoint lock
 Key: IGNITE-8688
 URL: https://issues.apache.org/jira/browse/IGNITE-8688
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Andrew Mashenkov
 Fix For: 2.6


This may lead to possible page corruption.

{noformat}
handled accordingly to configured handler [hnd=class 
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext 
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.AssertionError]]
[00:11:56]W: [org.gridgain:gridgain-compatibility] 
java.lang.AssertionError
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.allocatePage(PageMemoryImpl.java:463)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.allocateForTree(IgniteCacheOffheapManagerImpl.java:818)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.initPendingTree(IgniteCacheOffheapManagerImpl.java:164)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.onCacheStarted(IgniteCacheOffheapManagerImpl.java:151)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.CacheGroupContext.onCacheStarted(CacheGroupContext.java:283)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheStart(GridCacheProcessor.java:1965)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:791)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:946)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:651)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2458)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2338)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
[00:11:56]W: [org.gridgain:gridgain-compatibility]  at 
java.lang.Thread.run(Thread.java:748)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8610) Searching checkpoint / WAL history for rebalancing is not properly working in case of local/global WAL disabling

2018-05-24 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8610:
---

 Summary: Searching checkpoint / WAL history for rebalancing is not 
properly working in case of local/global WAL disabling
 Key: IGNITE-8610
 URL: https://issues.apache.org/jira/browse/IGNITE-8610
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


After implementation IGNITE-6411 and IGNITE-8087 we can face with situation 
when after some checkpoint, WAL was temporarily disabled and enabled again. In 
this case we can't treat such checkpoint as start point to rebalance, because 
WAL history after such checkpoint may contain gaps.

We should rework our checkpoint / wal history searching mechanism and ignore 
such checkpoints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8544) WAL disabling during rebalance mechanism uses wrong topology version in case of exchanges merge

2018-05-21 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8544:
---

 Summary: WAL disabling during rebalance mechanism uses wrong 
topology version in case of exchanges merge
 Key: IGNITE-8544
 URL: https://issues.apache.org/jira/browse/IGNITE-8544
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


After exchange is done, we're using initial exchange version to determine 
topology version on what rebalance should be finished and save it. After 
rebalance finishing we check current topology version and saved version and if 
they are equal, we enable WAL, own partitions and do checkpoint. In other case 
we do nothing, because of topology change. 
In case of exchanges merge we're saving old topology version (before merge) and 
it leads to ignoring logic of enabling WAL and etc, because check on topology 
version will be always false-positive.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8527) Show actual rebalance starting in logs

2018-05-18 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8527:
---

 Summary: Show actual rebalance starting in logs
 Key: IGNITE-8527
 URL: https://issues.apache.org/jira/browse/IGNITE-8527
 Project: Ignite
  Issue Type: Improvement
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


We should increase level of logging from DEBUG to INFO for message:

{noformat}
if (log.isDebugEnabled())
log.debug("Requested rebalancing [from node=" + 
node.id() + ", listener index=" + topicId + " " + demandMsg.rebalanceId() + ", 
partitions count=" + stripePartitions.get(topicId).size() + " (" + 
stripePartitions.get(topicId).partitionsList() + ")]");

{noformat}

to have actual rebalancing start time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8482) Skip 2-phase partition release wait in case of activation or dynamic caches start

2018-05-14 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8482:
---

 Summary: Skip 2-phase partition release wait in case of activation 
or dynamic caches start
 Key: IGNITE-8482
 URL: https://issues.apache.org/jira/browse/IGNITE-8482
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


Currently we perform 2-phase partitions release waiting on any type of 
distributed exchange. We can optimize this behaviour, skipping such waiting on 
cluster activation (obviously if we activate cluster it means that before 
activation no caches were running and there is no reason to wait for finishing 
operations) and on dynamic caches start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8459) Searching checkpoint history for WAL rebalance is broken

2018-05-08 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8459:
---

 Summary: Searching checkpoint history for WAL rebalance is broken
 Key: IGNITE-8459
 URL: https://issues.apache.org/jira/browse/IGNITE-8459
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


Currently the mechanism to search available checkpoint records in WAL to have 
history for WAL rebalance is broken. It means that WAL (Historical) rebalance 
will never find history for rebalance and full rebalance will be always used.

This mechanism was broken in 
https://github.com/apache/ignite/commit/ec04cd174ed5476fba83e8682214390736321b37
 by unclear reasons.

If we swap the following two code blocks (database().beforeExchange() and 
exchCtx if block):

{noformat}
/* It is necessary to run database callback before all topology 
callbacks.
   In case of persistent store is enabled we first restore partitions 
presented on disk.
   We need to guarantee that there are no partition state changes 
logged to WAL before this callback
   to make sure that we correctly restored last actual states. */
cctx.database().beforeExchange(this);

if (!exchCtx.mergeExchanges()) {
for (CacheGroupContext grp : cctx.cache().cacheGroups()) {
if (grp.isLocal() || cacheGroupStopping(grp.groupId()))
continue;

// It is possible affinity is not initialized yet if node joins 
to cluster.
if (grp.affinity().lastVersion().topologyVersion() > 0)
grp.topology().beforeExchange(this, !centralizedAff && 
!forceAffReassignment, false);
}
}
{noformat}

the searching mechanism will start to work correctly. Currently it's unclear 
why it's happened.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8422) Zookeeper discovery split brain detection shouldn't consider client nodes

2018-04-28 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8422:
---

 Summary: Zookeeper discovery split brain detection shouldn't 
consider client nodes
 Key: IGNITE-8422
 URL: https://issues.apache.org/jira/browse/IGNITE-8422
 Project: Ignite
  Issue Type: Bug
  Components: zookeeper
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


Currently Zookeeper discovery checks each splitted cluster on full connectivity 
taking into account client nodes. This is not correct, because server and 
client nodes may use different networks to connect to each other. It means that 
there can be client which sees both parts of splitted cluster and breaks split 
brain recovery - full connected part of server nodes will be never find.

We should exclude client nodes from split brain analysis and improve split 
brain tests to make them truly fair.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8415) Manual cache().rebalance() invocation may cancel currently running rebalance

2018-04-27 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8415:
---

 Summary: Manual cache().rebalance() invocation may cancel 
currently running rebalance
 Key: IGNITE-8415
 URL: https://issues.apache.org/jira/browse/IGNITE-8415
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.6


If historical rebalance happens and during this process we manually invoke 
{noformat}
Ignite.cache(CACHE_NAME).rebalance().get();
{noformat}
then currently running rebalance will be cancelled and started new which seems 
not right way. Moreover, after new rebalance finish we can lost some data in 
case of rebalancing entry removes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8405) Sql query may see intermediate results of topology changes and do mapping incorrectly

2018-04-26 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8405:
---

 Summary: Sql query may see intermediate results of topology 
changes and do mapping incorrectly
 Key: IGNITE-8405
 URL: https://issues.apache.org/jira/browse/IGNITE-8405
 Project: Ignite
  Issue Type: Bug
  Components: cache, sql
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


Affected test: IgniteStableBaselineCacheQueryNodeRestartsSelfTest

Sql query do mapping in following way:
1) If there is at least 1 moving partition query will be mapped to current 
partition owners
2) In other case affinity mapping will be used.

In case of first approach query may see not final partition state if mapping 
happens during PME. There is "setOwners()" method which does partition movement 
one-by-one, each time obtaining topology write lock. If query mapping happens 
in this time it may see that there is some moving partition and performed 
mapping to OWNING partition which will be moved to MOVING on next "setOwners()" 
invocation.

As result we may query from invalid partitions.

As intermediate solution "setOwners()" method should be refactored in a batch 
way to perform ALL partitions state changes to MOVING in one operation.

As general solution query mapping should be revisited, especially 
"isPreloadingActive" method, to take into account given topology version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8392) Removing WAL history directory leads to JVM crush on that node.

2018-04-25 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8392:
---

 Summary: Removing WAL history directory leads to JVM crush on that 
node.
 Key: IGNITE-8392
 URL: https://issues.apache.org/jira/browse/IGNITE-8392
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
 Environment: Ubuntu 17.10
Oracle JVM Server (1.8.0_151-b12)
Reporter: Pavel Kovalenko
 Fix For: 2.6


Problem:
1) Start node, load some data, deactivate cluster
2) Remove WAL history directory.
3) Activate cluster.

Cluster activation will be failed due to JVM crush like this:

{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7feda1052526, pid=29331, tid=0x7fed193d7700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 
1.8.0_151-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# v  ~StubRoutines::jshort_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---  T H R E A D  ---

Current thread (0x7fec8b202800):  JavaThread 
"db-checkpoint-thread-#243%wal.IgniteWalRebalanceTest0%" [_thread_in_Java, 
id=29655, stack(0x7fed192d7000,0x7fed193d8000)]

siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 
0x7fed198ee0b2

Registers:
RAX=0x0007710a9f28, RBX=0x000120b2, RCX=0x0800, 
RDX=0xfe08
RSP=0x7fed193d5c60, RBP=0x7fed193d5c60, RSI=0x7fed198ef0aa, 
RDI=0x0007710a9f20
R8 =0x1000, R9 =0x000120b2, R10=0x7feda1052da0, 
R11=0x1004
R12=0x, R13=0x0007710a9f28, R14=0x1000, 
R15=0x7fec8b202800
RIP=0x7feda1052526, EFLAGS=0x00010282, CSGSFS=0x002b0033, 
ERR=0x0006
  TRAPNO=0x000e

Top of Stack: (sp=0x7fed193d5c60)
0x7fed193d5c60:   0007710a9f28 7feda1be314f
0x7fed193d5c70:   00010002 7feda17747fd
0x7fed193d5c80:   a8008c96 7feda11cfb3e
0x7fed193d5c90:    
0x7fed193d5ca0:    
0x7fed193d5cb0:    
0x7fed193d5cc0:   0007710a9f28 7feda1fb37e0
0x7fed193d5cd0:   0007710a8ef0 00076fa5f5c0
0x7fed193d5ce0:   0007710a9f28 0007710a8ef0
0x7fed193d5cf0:   0007710a8ef0 7fed193d5d18
0x7fed193d5d00:   7fedb8428c76 
0x7fed193d5d10:   1014 00076fa5f650
0x7fed193d5d20:   f8043261 7feda1ee597c
0x7fed193d5d30:   00076fa5f5a8 0007710a9f28
0x7fed193d5d40:   0007710a8ef0 000120a2
0x7fed193d5d50:   00012095 1021
0x7fed193d5d60:   edf4bec3 0001209e
0x7fed193d5d70:   0007710a9f28 00076fa5f650
0x7fed193d5d80:   7fed193d5da8 1014
0x7fed193d5d90:   0007710a8ef0 7fed198dc000
0x7fed193d5da0:   00076fa5f650 7feda1b7a040
0x7fed193d5db0:   0007710a9f28 00076fa700d0
0x7fed193d5dc0:   0007710a9f68 ee2153e5f8043261
0x7fed193d5dd0:   0007710a8ef0 0007710a9f98
0x7fed193d5de0:   00012095 0007710a9f28
0x7fed193d5df0:    1fa0
0x7fed193d5e00:    
0x7fed193d5e10:   0007710a8ef0 7feda2001530
0x7fed193d5e20:   0007710a8ef0 00076f7c05e8
0x7fed193d5e30:   edef80bd 
0x7fed193d5e40:    
0x7fed193d5e50:   7fedb2266000 7feda1cb1f8c 

Instructions: (pc=0x7feda1052526)
0x7feda1052506:   00 00 74 08 66 8b 47 08 66 89 46 08 48 33 c0 c9
0x7feda1052516:   c3 66 0f 1f 84 00 00 00 00 00 c5 fe 6f 44 d7 c8
0x7feda1052526:   c5 fe 7f 44 d6 c8 c5 fe 6f 4c d7 e8 c5 fe 7f 4c
0x7feda1052536:   d6 e8 48 83 c2 08 7e e2 48 83 ea 04 7f 10 c5 fe 

Register to memory mapping:

RAX=0x0007710a9f28 is an oop
java.nio.DirectByteBuffer 
 - klass: 'java/nio/DirectByteBuffer'
RBX=0x000120b2 is an unknown value
RCX=0x0800 is an unknown value
RDX=0xfe08 is an unknown value
RSP=0x7fed193d5c60 is pointing into the stack for thread: 0x7fec8b202800
RBP=0x7fed193d5c60 is pointing into the stack for thread: 0x7fec8b202800
RSI=0x7fed198ef0aa is an unknown value
RDI=0x0007710a9f20 is an oop
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

2018-04-25 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8391:
---

 Summary: Removing some WAL history segments leads to WAL rebalance 
hanging
 Key: IGNITE-8391
 URL: https://issues.apache.org/jira/browse/IGNITE-8391
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.6


Problem:
1) Start 2 nodes, load some data to it.
2) Stop node 2, load some data to cache.
3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
to find start point for WAL rebalance, but contains necessary data for 
rebalancing. 
4) Start node 2, this node will start rebalance data from node 1 using WAL.

Rebalance will be hanged with following assertion:

{noformat}
java.lang.AssertionError: Partitions after rebalance should be either done or 
missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
 
This happened because we never reached necessary data and updateCounters 
contained in removed WAL segment.

To resolve such problems we should introduce some fallback strategy if 
rebalance by WAL has been failed. Example of fallback strategy is - re-run full 
rebalance for partitions that were not able properly rebalanced using WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8390) WAL historical rebalance is not able to process cache.remove() updates

2018-04-25 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8390:
---

 Summary: WAL historical rebalance is not able to process 
cache.remove() updates
 Key: IGNITE-8390
 URL: https://issues.apache.org/jira/browse/IGNITE-8390
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


WAL historical rebalance fails on supplier when process entry remove with 
following assertion:

{noformat}
java.lang.AssertionError: GridCacheEntryInfo [key=KeyCacheObjectImpl [part=-1, 
val=2, hasValBytes=true], cacheId=94416770, val=null, ttl=0, expireTime=0, 
ver=GridCacheVersion [topVer=136155335, order=1524675346187, nodeOrder=1], 
isNew=false, deleted=false]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplyMessage.addEntry0(GridDhtPartitionSupplyMessage.java:220)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:381)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

Obviously this assertion will work correctly only for full rebalance. We should 
either soft assertion for historical rebalance case or disable it.
In case of disabled assertion everything works well and rebalance finished 
properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8339) After cluster activation actual partition state restored from WAL may be lost

2018-04-20 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8339:
---

 Summary: After cluster activation actual partition state restored 
from WAL may be lost
 Key: IGNITE-8339
 URL: https://issues.apache.org/jira/browse/IGNITE-8339
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.5


On cluster activation we restore partition states from checkpoint and WAL. But 
before that action we pre-create partitions by ideal assignment on 
"beforeExchange" phase and own it in case of first or next activation. This 
partition state change is logged to WAL and override actual last state of 
partition during restore.

Possible solutions:
1) Pre-create partitions after actual restore.
2) Do not log to WAL partition own on pre-create phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8338) Cache operations hang after cluster deactivation and activation again

2018-04-20 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8338:
---

 Summary: Cache operations hang after cluster deactivation and 
activation again
 Key: IGNITE-8338
 URL: https://issues.apache.org/jira/browse/IGNITE-8338
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.6


Problem:
1) Start several nodes
2) Activate cluster
3) Run cache load
4) Deactivate cluster
5) Activate again

After second activation cache operations hang with following stacktrace:

{noformat}
"cache-load-2" #210 prio=5 os_prio=0 tid=0x7efbb401b800 nid=0x602b waiting 
on condition [0x7efb809b3000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.publicJCache(GridCacheProcessor.java:3782)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.publicJCache(GridCacheProcessor.java:3753)
at 
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.checkProxyIsValid(GatewayProtectedCacheProxy.java:1486)
at 
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.onEnter(GatewayProtectedCacheProxy.java:1508)
at 
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.put(GatewayProtectedCacheProxy.java:785)
at 
org.apache.ignite.internal.processors.cache.IgniteClusterActivateDeactivateTestWithPersistence.lambda$testDeactivateDuringEviction$0(IgniteClusterActivateDeactivateTestWithPersistence.java:316)
at 
org.apache.ignite.internal.processors.cache.IgniteClusterActivateDeactivateTestWithPersistence$$Lambda$39/832408842.run(Unknown
 Source)
at 
org.apache.ignite.testframework.GridTestUtils$6.call(GridTestUtils.java:1254)
at 
org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:86)
{noformat}

It seems, dynamicStartCache future never completes after second activation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8324) Ignite Cache Restarts 1 suite hangs with assertion error

2018-04-19 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8324:
---

 Summary: Ignite Cache Restarts 1 suite hangs with assertion error
 Key: IGNITE-8324
 URL: https://issues.apache.org/jira/browse/IGNITE-8324
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.5


{noformat}
[ERROR][exchange-worker-#620749%replicated.GridCacheReplicatedNodeRestartSelfTest0%][GridDhtPartitionsExchangeFuture]
 Failed to notify listener: 
o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2@6dd7cc93
java.lang.AssertionError: Invalid topology version [grp=ignite-sys-cache, 
topVer=AffinityTopologyVersion [topVer=323, minorTopVer=0], 
exchTopVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], 
discoCacheVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], 
exchDiscoCacheVer=AffinityTopologyVersion [topVer=323, minorTopVer=0], 
fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
[evtNode=TcpDiscoveryNode [id=48a5d243-7f63-4069-aba1-868c6895, 
addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, 
intOrder=163, lastExchangeTime=1524043684082, loc=false, 
ver=2.5.0#20180417-sha1:56be24b9, isClient=false], topVer=322, 
nodeId8=b51b3893, msg=Node joined: TcpDiscoveryNode 
[id=48a5d243-7f63-4069-aba1-868c6895, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, intOrder=163, 
lastExchangeTime=1524043684082, loc=false, ver=2.5.0#20180417-sha1:56be24b9, 
isClient=false], type=NODE_JOINED, tstamp=1524043684166], crd=TcpDiscoveryNode 
[id=b51b3893-377a-465f-88ea-316a6560, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, 
lastExchangeTime=1524043633288, loc=true, ver=2.5.0#20180417-sha1:56be24b9, 
isClient=false], exchId=GridDhtPartitionExchangeId 
[topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], 
discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode 
[id=48a5d243-7f63-4069-aba1-868c6895, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, intOrder=163, 
lastExchangeTime=1524043684082, loc=false, ver=2.5.0#20180417-sha1:56be24b9, 
isClient=false], topVer=322, nodeId8=b51b3893, msg=Node joined: 
TcpDiscoveryNode [id=48a5d243-7f63-4069-aba1-868c6895, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, intOrder=163, 
lastExchangeTime=1524043684082, loc=false, ver=2.5.0#20180417-sha1:56be24b9, 
isClient=false], type=NODE_JOINED, tstamp=1524043684166], nodeId=48a5d243, 
evt=NODE_JOINED], added=true, initFut=GridFutureAdapter 
[ignoreInterrupts=false, state=DONE, res=true, hash=527135060], init=true, 
lastVer=GridCacheVersion [topVer=135523955, order=1524043694535, nodeOrder=3], 
partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion 
[topVer=322, minorTopVer=0], futures=[ExplicitLockReleaseFuture 
[topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], 
AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, 
minorTopVer=0], futures=[]], DataStreamerReleaseFuture 
[topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], 
LocalTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, 
minorTopVer=0], futures=[]], AllTxReleaseFuture [topVer=AffinityTopologyVersion 
[topVer=322, minorTopVer=0], futures=[RemoteTxReleaseFuture 
[topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], 
exchActions=null, affChangeMsg=null, initTs=1524043684166, 
centralizedAff=false, forceAffReassignment=false, changeGlobalStateE=null, 
done=false, state=CRD, evtLatch=0, remaining=[], super=GridFutureAdapter 
[ignoreInterrupts=false, state=INIT, res=null, hash=1570781250]]]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.updateTopologyVersion(GridDhtPartitionTopologyImpl.java:257)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updateTopologies(GridDhtPartitionsExchangeFuture.java:845)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:2461)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2200)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:127)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2057)
at 

[jira] [Created] (IGNITE-8313) Trace logs enhancement for exchange and affinity calculation

2018-04-18 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8313:
---

 Summary: Trace logs enhancement for exchange and affinity 
calculation
 Key: IGNITE-8313
 URL: https://issues.apache.org/jira/browse/IGNITE-8313
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.6


For better problems debugging we should add more trace logs to following places:
1) Partition states before and after exchange.
2) Affinity distribution for each topology version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8218) Add exchange latch state to diagnostic messages

2018-04-11 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8218:
---

 Summary: Add exchange latch state to diagnostic messages
 Key: IGNITE-8218
 URL: https://issues.apache.org/jira/browse/IGNITE-8218
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.5






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8122) Partition state restored from WAL may be lost if no checkpoints are done

2018-04-03 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8122:
---

 Summary: Partition state restored from WAL may be lost if no 
checkpoints are done
 Key: IGNITE-8122
 URL: https://issues.apache.org/jira/browse/IGNITE-8122
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.5


Problem:
1) Start several nodes with enabled persistence.
2) Make sure that all partitions for 'ignite-sys-cache' have status OWN on all 
nodes and appropriate PartitionMetaStateRecord record is logged to WAL
3) Stop all nodes and start again, activate cluster. Checkpoint for 
'ignite-sys-cache' is empty, because there were no data in cache.
4) State for all partitions will be restored to OWN 
(GridCacheDatabaseSharedManager#restoreState) from WAL, but not recorded to 
page memory, because there were no checkpoints and data in cache. Store manager 
is not properly initialized for such partitions.
5) On exchange done we're trying to restore states of partitions 
(initPartitionsWhenAffinityReady) on all nodes. Because page memory is empty, 
states of all partitions will be restored to MOVING by default.
6) All nodes start to rebalance partitions from each other and this process 
become unpredictable because we're trying to rebalance from MOVING partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8063) Transaction rollback is unmanaged in case when commit produced Runtime exception

2018-03-28 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8063:
---

 Summary: Transaction rollback is unmanaged in case when commit 
produced Runtime exception
 Key: IGNITE-8063
 URL: https://issues.apache.org/jira/browse/IGNITE-8063
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.5


When 'userCommit' produces an runtime exception transaction state is moved to 
UNKNOWN, and tx.finishFuture() completes, after that rollback process runs 
asynchronously and there is no simple way to await rollback completion on such 
transactions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8062) Add ability to properly wait for transaction finish in case of PRIMARY_SYNC cache mode

2018-03-28 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8062:
---

 Summary: Add ability to properly wait for transaction finish in 
case of PRIMARY_SYNC cache mode
 Key: IGNITE-8062
 URL: https://issues.apache.org/jira/browse/IGNITE-8062
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.5


Currently GridDhtTxFinishFuture may be finished ahead of time in case of 
PRIMARY_SYNC mode and there is no way to properly wait for such futures 
finishing on remote nodes.

We should introduce ability to wait for full transaction completion in such 
cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7987) Affinity may be not calculated properly in case of merged exchanges with client nodes

2018-03-19 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7987:
---

 Summary: Affinity may be not calculated properly in case of merged 
exchanges with client nodes
 Key: IGNITE-7987
 URL: https://issues.apache.org/jira/browse/IGNITE-7987
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.5


Currently we pass only last (or first in some cases) discovery event for 
affinity calculation at GridAffinityAssignmentCache.

Affinity calculation can be skipped if such discovery event belongs to client 
node or node filtered by nodeFilter for optimization issues (because affinity 
will not be changed in such case).

Since we have exchange merging there can be several discovery events 
corresponds to one exchange. Passing only first or last event for affinity 
calculation is wrong, because calculation can be skipped, while exchange 
actually contains events changing affinity.

Instead of first/last event we should pass whole collection of discovery events 
(ExchangeDiscoveryEvents) and skip affinity calculation for a group only when 
ALL events doesn't change affinity for such group.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7946) IgniteCacheClientQueryReplicatedNodeRestartSelfTest#testRestarts can hang on TC

2018-03-14 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7946:
---

 Summary: 
IgniteCacheClientQueryReplicatedNodeRestartSelfTest#testRestarts can hang on TC
 Key: IGNITE-7946
 URL: https://issues.apache.org/jira/browse/IGNITE-7946
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko


According to test logs there can be unfinished rebalance:

{noformat}
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,327][INFO 
][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander]
 Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
[topVer=103, minorTopVer=0]]
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,327][INFO 
][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander]
 Completed rebalance future: RebalanceFuture [grp=CacheGroupContext [grp=pr], 
topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0], rebalanceId=1]
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO 
][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridCachePartitionExchangeManager]
 Rebalancing scheduled [order=[pe, pr]]
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO 
][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridCachePartitionExchangeManager]
 Rebalancing started [top=AffinityTopologyVersion [topVer=104, minorTopVer=0], 
evt=NODE_LEFT, node=04d02ea1-286c-4d8c-8870-e147c552]
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO 
][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander]
 Starting rebalancing [grp=pe, mode=SYNC, 
fromNode=31193890-bf8f-4c85-af76-342efb31, partitionsCount=15, 
topology=AffinityTopologyVersion [topVer=104, minorTopVer=0], rebalanceId=2]
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO 
][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander]
 Starting rebalancing [grp=pe, mode=SYNC, 
fromNode=517f4efb-4433-489a-8c8e-e91f9e70, partitionsCount=16, 
topology=AffinityTopologyVersion [topVer=104, minorTopVer=0], rebalanceId=2]
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO 
][exchange-worker-#455983%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest0%][GridCachePartitionExchangeManager]
 Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
[topVer=104, minorTopVer=0], evt=NODE_LEFT, 
node=04d02ea1-286c-4d8c-8870-e147c552]
[01:21:23] : [Step 4/5] [2018-03-13 22:21:23,332][INFO 
][sys-#456730%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander]
 Completed rebalancing [fromNode=517f4efb-4433-489a-8c8e-e91f9e70, 
cacheOrGroup=pe, topology=AffinityTopologyVersion [topVer=104, minorTopVer=0], 
time=0 ms]

{noformat}


It can be cause of test hanging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7898) IgniteCachePartitionLossPolicySelfTest is flaky on TC

2018-03-07 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7898:
---

 Summary: IgniteCachePartitionLossPolicySelfTest is flaky on TC
 Key: IGNITE-7898
 URL: https://issues.apache.org/jira/browse/IGNITE-7898
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


Affected tests:
testReadOnlyAll
testReadWriteSafe

Exception:
{code:java}
junit.framework.AssertionFailedError: Failed to find expected lost partition 
[exp=0, lost=[]]
at 
org.apache.ignite.internal.processors.cache.distributed.IgniteCachePartitionLossPolicySelfTest.verifyCacheOps(IgniteCachePartitionLossPolicySelfTest.java:219)
at 
org.apache.ignite.internal.processors.cache.distributed.IgniteCachePartitionLossPolicySelfTest.checkLostPartition(IgniteCachePartitionLossPolicySelfTest.java:166)
at 
org.apache.ignite.internal.processors.cache.distributed.IgniteCachePartitionLossPolicySelfTest.testReadWriteSafe(IgniteCachePartitionLossPolicySelfTest.java:114)
{code}

The problem of failure:
After we prepare topology and shutdown the node containing lost partition we 
start to check it immediately on all nodes (cache.lostPartitions() method). 
Sometimes we invoke this method on client node where last PME is not even 
started and getting empty list of lost partitions because we haven't received 
it yet on PME.

Possible solution:
Wait for PME finishing on all nodes (including client) before start to check 
for lost partitions.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7882) Atomic update requests should always use topology mappings instead of affinity

2018-03-05 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7882:
---

 Summary: Atomic update requests should always use topology 
mappings instead of affinity
 Key: IGNITE-7882
 URL: https://issues.apache.org/jira/browse/IGNITE-7882
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


Currently for mapping cache atomic updates we can use two ways:
1) Use nodes reporting status OWNING for partition where we send the update.
2) Use only affinity nodes mapping if rebalance is finished.

Using the second way we may route update request only to affinity node, while 
there is also node which is still owner and can process read requests.

It can lead to reading null values for some key, while update for such key was 
successful a moment ago.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7873) Partition update counters and sizes may be different if cache is using readThrough

2018-03-02 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7873:
---

 Summary: Partition update counters and sizes may be different if 
cache is using readThrough
 Key: IGNITE-7873
 URL: https://issues.apache.org/jira/browse/IGNITE-7873
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko


Tracking partition update counters and cache sizes may not properly work if 
cache is using readThrough behavior.

Read requests to such cache can increment update counters or cache sizes not on 
all nodes serving such cache in case if data in underlying storage is changed.
It means that update counter or cache size will be incremented only on 
partition where we followed such request (primary or any random node).

BackupPostProcessingClosure should use preload=false for entry. In other case 
it can increment update counter for read request while data is not changed.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7871) Partition update counters may be different during exchange

2018-03-02 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7871:
---

 Summary: Partition update counters may be different during exchange
 Key: IGNITE-7871
 URL: https://issues.apache.org/jira/browse/IGNITE-7871
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko


Using validation implemented in IGNITE-7467 we can observe the following 
situation:

Let's we have some partition and nodes which owning it N1 (primary) and N2 
(backup)

1) Exchange is started
2) N2 finished waiting for partitions release and started to create Single 
message (with update counters).
3) N1 waits for partitions release.
4) We have pending cache update N1 -> N2. This update is done after step 2.
5) This update increments update counters both on N1 and N2.
6) N1 finished waiting for partitions release, while N2 already sent Single 
message to coordinator with outdated update counter.
7) Coordinator sees different partition update counters for N1 and N2. 
Validation is failed, while data is equal.  

Possible solutions:
1) Cancel transactions and atomic updates on backups if topology version on 
them is already changed (or waiting for partitions release is finished).
2) Each node participating in exchange should wait for partitions release of 
other nodes not only self (like distributed countdown latch right after waiting 
for partitions release).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7833) Find out possible ways to handle partition update counters inconsistency

2018-02-27 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7833:
---

 Summary: Find out possible ways to handle partition update 
counters inconsistency
 Key: IGNITE-7833
 URL: https://issues.apache.org/jira/browse/IGNITE-7833
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Reporter: Pavel Kovalenko


We should think about possible ways to resolve the situation when we observe 
that partition update counters for the same partitions (primary-backup) are 
different on some nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7795) Correct handling partitions restored in RENTING state

2018-02-22 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7795:
---

 Summary: Correct handling partitions restored in RENTING state
 Key: IGNITE-7795
 URL: https://issues.apache.org/jira/browse/IGNITE-7795
 Project: Ignite
  Issue Type: Bug
  Components: cache, persistence
Affects Versions: 2.3, 2.2, 2.1, 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.5


Let's we have node which has partition in state RENTING after start. It could 
happen if node was stopped during partition eviction.

Started up node is only one owner by affinity for this partition.

Currently we will own this partition during rebalance preparing phase which 
seems is not correct. 

If we don't have owners for some partitions we should fail activation process, 
move this partition to MOVING state and clear it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7773) Add getRebalanceClearingPartitionsLeft JMX metric to .NET

2018-02-20 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7773:
---

 Summary: Add getRebalanceClearingPartitionsLeft JMX metric to .NET
 Key: IGNITE-7773
 URL: https://issues.apache.org/jira/browse/IGNITE-7773
 Project: Ignite
  Issue Type: Task
  Components: platforms
Reporter: Pavel Kovalenko
 Fix For: 2.5


New metric is introduced in IGNITE-6113 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7750) testMultiThreadStatisticsEnable is flaky on TC

2018-02-19 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7750:
---

 Summary: testMultiThreadStatisticsEnable is flaky on TC
 Key: IGNITE-7750
 URL: https://issues.apache.org/jira/browse/IGNITE-7750
 Project: Ignite
  Issue Type: Bug
  Components: cache
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


{code:java}
class org.apache.ignite.IgniteException: Cache not found [cacheName=cache2]
at 
org.apache.ignite.internal.util.IgniteUtils.convertException(IgniteUtils.java:985)
at 
org.apache.ignite.internal.cluster.IgniteClusterImpl.enableStatistics(IgniteClusterImpl.java:497)
at 
org.apache.ignite.internal.processors.cache.CacheMetricsEnableRuntimeTest$3.run(CacheMetricsEnableRuntimeTest.java:181)
at 
org.apache.ignite.testframework.GridTestUtils$9.call(GridTestUtils.java:1275)
at 
org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:86)
Caused by: class org.apache.ignite.IgniteCheckedException: Cache not found 
[cacheName=cache2]
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.enableStatistics(GridCacheProcessor.java:4227)
at 
org.apache.ignite.internal.cluster.IgniteClusterImpl.enableStatistics(IgniteClusterImpl.java:494)
... 3 more
{code}


The problem of the test:

1) We don't wait for exchange future completion after "cache2" is started and 
it may lead to NullPointerException when we try to obtain reference to "cache2" 
on the node which doesn't complete exchange future and initialize cache proxy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7749) testDiscoCacheReuseOnNodeJoin fails on TC

2018-02-19 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7749:
---

 Summary: testDiscoCacheReuseOnNodeJoin fails on TC
 Key: IGNITE-7749
 URL: https://issues.apache.org/jira/browse/IGNITE-7749
 Project: Ignite
  Issue Type: Bug
  Components: cache
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


{code:java}
java.lang.ClassCastException: 
org.apache.ignite.internal.util.GridConcurrentHashSet cannot be cast to 
java.lang.String
at 
org.apache.ignite.spi.discovery.IgniteDiscoveryCacheReuseSelfTest.assertDiscoCacheReuse(IgniteDiscoveryCacheReuseSelfTest.java:93)
at 
org.apache.ignite.spi.discovery.IgniteDiscoveryCacheReuseSelfTest.testDiscoCacheReuseOnNodeJoin(IgniteDiscoveryCacheReuseSelfTest.java:64)
{code}


There are 2 problems in the test.

1) We don't wait for final topology version is set on all nodes and start 
checking discovery caches immediately after grids starting. It leads to 
possible NullPointerException while accessing to discovery caches history.
2) We don't use explicit assertEquals(String, Object, Object) related to 
comparing Objects, while Java can choose assertEquals(String, String) method to 
compare discovery cache fields which we're getting in runtime using reflection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7717) testAssignmentAfterRestarts is flaky on TC

2018-02-15 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7717:
---

 Summary: testAssignmentAfterRestarts is flaky on TC
 Key: IGNITE-7717
 URL: https://issues.apache.org/jira/browse/IGNITE-7717
 Project: Ignite
  Issue Type: Bug
Reporter: Pavel Kovalenko






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-7500) Partition update counters may be inconsistent after rebalancing

2018-01-23 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-7500:
---

 Summary: Partition update counters may be inconsistent after 
rebalancing
 Key: IGNITE-7500
 URL: https://issues.apache.org/jira/browse/IGNITE-7500
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.3
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


Problem:
If partition rebalance requires more than one batch we are not sending `Clear` 
flag for it in the last supply message and as result don't set updateCounter to 
right value.

Temporary solution:
Send `Clear` flags for partitions that were fully rebalanced in the last supply 
message. But we still have a problem with race conditions with setting 
updateCounter during concurrent rebalance and cache load.

General solution:
https://issues.apache.org/jira/browse/IGNITE-6113



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-6029) Refactor WAL Record serialization and introduce RecordV2Serializer

2017-08-10 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-6029:
---

 Summary: Refactor WAL Record serialization and introduce 
RecordV2Serializer
 Key: IGNITE-6029
 URL: https://issues.apache.org/jira/browse/IGNITE-6029
 Project: Ignite
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.2


Currently RecordSerializer interface and default RecordV1Serializer 
implementation are not well extendable. We should refactor RecordSerializer 
interface and introduce new RecordV2Serializer with very base functionality - 
delegate everything to RecordV1Serializer.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IGNITE-6018) Introduce WAL backward compatibility for new DataPage insert/update records

2017-08-09 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-6018:
---

 Summary: Introduce WAL backward compatibility for new DataPage 
insert/update records
 Key: IGNITE-6018
 URL: https://issues.apache.org/jira/browse/IGNITE-6018
 Project: Ignite
  Issue Type: Sub-task
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
Priority: Blocker
 Fix For: 2.2


Once we store reference to DataRecord for DataPage insert/update records we 
should be able to read/write both versions of that records (with reference or 
with payload) for backward compatibility purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IGNITE-6017) Ignite IGFS: IgfsStreamsSelfTest#testCreateFileFragmented fails

2017-08-09 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-6017:
---

 Summary: Ignite IGFS: IgfsStreamsSelfTest#testCreateFileFragmented 
fails
 Key: IGNITE-6017
 URL: https://issues.apache.org/jira/browse/IGNITE-6017
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Pavel Kovalenko
Priority: Minor
 Fix For: 2.2


Failure is almost can't be reproduced locally.

Suppose it is the same problem as in IGNITE-5957 .

{noformat}
junit.framework.AssertionFailedError: expected:<2> but was:<1>
at junit.framework.Assert.fail(Assert.java:57)
at junit.framework.Assert.failNotEquals(Assert.java:329)
at junit.framework.Assert.assertEquals(Assert.java:78)
at junit.framework.Assert.assertEquals(Assert.java:234)
at junit.framework.Assert.assertEquals(Assert.java:241)
at junit.framework.TestCase.assertEquals(TestCase.java:409)
at 
org.apache.ignite.internal.processors.igfs.IgfsStreamsSelfTest.testCreateFileFragmented(IgfsStreamsSelfTest.java:264)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at junit.framework.TestCase.runTest(TestCase.java:176)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:2000)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:132)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1915)
at java.lang.Thread.run(Thread.java:745)
Aug 09, 2017 1:15:56 AM org.apache.ignite.logger.java.JavaLogger error
SEVERE: DataStreamer operation failed.
class org.apache.ignite.IgniteCheckedException: Data streamer has been 
cancelled: DataStreamerImpl 
[rcvr=org.apache.ignite.internal.processors.datastreamer.DataStreamerCacheUpdaters$BatchedSorted@13c54950,
 ioPlcRslvr=null, cacheName=igfs-internal-igfs-data, bufSize=512, 
parallelOps=16, timeout=-1, autoFlushFreq=0, 
bufMappings={908d1a4c-b352-4af5-b039-ded60c20=Buffer [node=TcpDiscoveryNode 
[id=908d1a4c-b352-4af5-b039-ded60c20, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, 
lastExchangeTime=1502241356486, loc=true, ver=2.2.0#19700101-sha1:, 
isClient=false], isLocNode=true, idGen=0, 
sem=java.util.concurrent.Semaphore@2bdbd7f0[Permits = 16], 
batchTopVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], entriesCnt=1, 
locFutsSize=0, reqsSize=0]}, cacheObjProc=GridProcessorAdapter [], 
cacheObjCtx=org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryContext@6e3de40e,
 cancelled=true, failCntr=0, activeFuts=GridConcurrentHashSet 
[elements=[GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, 
hash=431192362], GridFutureAdapter [ignoreInterrupts=false, state=INIT, 
res=null, hash=625896337], GridFutureAdapter [ignoreInterrupts=false, 
state=INIT, res=null, hash=1440203156]]], jobPda=null, depCls=null, 
fut=DataStreamerFuture [super=GridFutureAdapter [ignoreInterrupts=false, 
state=DONE, res=null, hash=1612913644]], publicFut=IgniteFuture 
[orig=DataStreamerFuture [super=GridFutureAdapter [ignoreInterrupts=false, 
state=DONE, res=null, hash=1612913644]]], disconnectErr=null, closed=true, 
lastFlushTime=1502241356435, skipStore=false, keepBinary=false, maxRemapCnt=32, 
remapSem=java.util.concurrent.Semaphore@194e0ba1[Permits = 2147483647], 
remapOwning=false]
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$5.apply(DataStreamerImpl.java:865)
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$5.apply(DataStreamerImpl.java:834)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:382)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:494)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:473)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:461)
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$Buffer.onNodeLeft(DataStreamerImpl.java:1757)
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$6.run(DataStreamerImpl.java:952)
at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6687)
at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$1.body(GridClosureProcessor.java:817)
at 

  1   2   >