Re: Partition states validation has filed for group: CUSTOMER_KV

2021-09-11 Thread Pavel Kovalenko
Hi Naveen,

I think just stopping updates is not enough to make a consistent snapshot
of the partition stores.
You must ensure that all updates are also checkpointed to disk. Otherwise,
to restore a valid snapshot you must copy WAL as well as partition stores.
You can try to deactivate the source cluster, make a copy of the partition
stores, and then activate it again.


чт, 9 сент. 2021 г. в 15:42, Naveen Kumar :

> Any pointers or clues on this issue.
>
> If it the issue with the source cluster or something to do with the target
> cluster ?
> Does the clean restart of the source cluster help here in any
> way, inconsistent partitions becoming consistent etc ?
>
> Thanks
>
> On Wed, Sep 8, 2021 at 12:12 PM Naveen Kumar 
> wrote:
>
>> Hi
>>
>> We are using Ignite 2.8.1
>>
>> We are trying to build a new cluster by restoring the datastore from
>> another working cluster.
>> Steps followed
>>
>> 1. Stopped the updates on the source cluster
>> 2. Took a copy of datastore on each node and transferred to the
>> destination node
>> 3. started nodes on the destination cluster
>>
>> AFter the cluster is activated, we could see count mismatch for 2 caches
>> (around 15K records) and we found some warnings for these 2 caches.
>> Attached the exact warning,
>>
>> [GridDhtPartitionsExchangeFuture] Partition states validation has failed
>> for group: CL_CUSTOMER_KV, msg: Partitions cache sizes are inconsistent for
>> part 310: [lvign002b..com=874, lvign001b..com=875] etc..
>>
>> What could be the reason for this count mismatch.
>>
>> Thanks
>>
>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Naveen Bandaru
>>
>
>
> --
> Thanks & Regards,
> Naveen Bandaru
>


Re: Regarding Partition Map exchange Triggers

2020-12-12 Thread Pavel Kovalenko
Hi

> According to the exception log of the following topic, a client node
joins the cluster and blocks a SQL query on the transactional cache. Is
this true?
> Now it seems that the relevant explanations are confusing?

Looking at your stack traces it seems that the cache you accessed in the
SQL query was stopped right before the new client node is joined.
All topology events are processed one-by-one, so the initial blocking time
was caused by cache stop PME rather than client node join.
However, SQL and cache endpoints have different mechanisms for dealing with
partitions affinity. My explanations in this topic were about only cache
operations.
For SQL it seems that client node join can still block SQL query for a
time, but client PME is very fast, so its impact should be minimal.



сб, 12 дек. 2020 г. в 04:34, 38797715 <38797...@qq.com>:

> Hi,
>
> According to the exception log of the following topic, a client node joins
> the cluster and blocks a SQL query on the transactional cache. Is this true?
>
>
> http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-affinity-ready-future-for-topology-version-AffinityTopologyVersion-td34823.html
>
> Now it seems that the relevant explanations are confusing?
> 在 2020/12/11 下午8:21, Pavel Kovalenko 写道:
>
> Hi,
>
> I think it's wrong information on the wiki that PME is not triggered for
> some cases. It should be fixed.
> Actually, PME is triggered in all cases but for some of them it doesn't
> block cache operations or the time of blocking is minimized.
> Most optimizations for minimizing the blocking time of PME have been done
> in Ignite 2.8.
>
> Thick client join/left PME - doesn't block operations at all.
>
> Other events can be ordered by their potential blocking time:
> 1. Non-baseline node left/join - minimal
> 2. Baseline node stop/left
> 3. Baseline node join
> 4. Baseline change - heaviest operation
>
> > *for the end user , is this invoked when we do ignite.getOrCreate( xx )
> and ignite.cache(xx )*
>
> Yes.
>
> пт, 11 дек. 2020 г. в 14:55, VeenaMithare :
>
>> Hi ,
>>
>>
>> I can see the triggers for PME initiation here :
>>
>> https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood
>>
>> Triggers
>> Events which causes exchange
>>
>> Topology events:
>>
>> Node Join (EVT_NODE_JOINED) - new node discovered and joined topology
>> (exchange is done after a node is included into the ring). This event
>> doesn't trigger the PME if a thick client connects the cluster and an
>> Ignite
>> version is 2.8 or later.
>>
>>
>> --> *This means in ignite 2.8 or higher, this is triggered only if nodes
>> that participate in the baseline topology are added ?*
>>
>>
>> Node Left (EVT_NODE_LEFT) - correct shutdown with call ignite.close. This
>> event doesn't trigger the PME in Ignite 2.8 and later versions if a node
>> belonging to an existing baseline topology leaves.
>>
>> --> *This means this is not triggered at all 2.8.1 or higher if shutdown
>> cleanly ? i.e. if this is called : Ignition.stop(false) *
>>
>>
>> Node Failed (EVT_NODE_FAILED) - detected unresponsive node, probably
>> crashed
>> and is considered failed
>>
>> --> *This means this is  triggered at all 2.8.1 or higher for baseline
>> nodes
>> or any thick client node ?*
>>
>> Custom events:
>>
>> Activation / Deactivation / Baseline topology set -
>> ChangeGlobalStateMessage
>> Dynamic cache start / Dynamic cache stop - DynamicCacheChangeBatch
>>
>> --> *for the end user , is this invoked when we do ignite.getOrCreate( xx
>> )
>> and ignite.cache(xx )*
>>
>>
>> Snapshot create / restore - SnapshotDiscoveryMessage
>> Global WAL enable / disable - WalStateAbstractMessage
>> Late affinity assignment - CacheAffinityChangeMessage
>>
>>
>> regards,
>> Veena.
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>


Re: Regarding Partition Map exchange Triggers

2020-12-11 Thread Pavel Kovalenko
Your thoughts are right.
If cache exists no PME will be started.
If doesn't exist - getOrCreate() method will create it and start PME,
cache() method will throw an exception or return null (doesn't remember
what exactly)

пт, 11 дек. 2020 г. в 17:48, VeenaMithare :

> HI Pavel,
>
> Thank you for the reply.
>
> >>> *for the end user , is this invoked when we do ignite.getOrCreate( xx )
> and ignite.cache(xx )*
>
> >>Yes.
>
>
> getOrCreateCache would create a cache if doesnt exist . I would guess this
> would have the effect of starting the cache if doesnt exist . And I think
> this would start the PME.
>
> And ignite.cache would only get an instance of the existing cache. In this
> case, I would imagine the cache already exists and a reference to the cache
> is returned. I would have thought this WOULD NOT have the effect of
> starting
> the cache and hence should not start the PME.
>
> Please guide. Below is the documentation of getOrCreateCache and
> ignite.cache from the javadocs.
>
> getOrCreateCache(CacheConfiguration cacheCfg)
> Gets existing cache with the given name or creates new one with the given
> configuration.
>
>  IgniteCache cache(String name)
>   throws javax.cache.CacheException
> Gets an instance of IgniteCache API. IgniteCache is a fully-compatible
> implementation of JCache (JSR 107) specification.
>
> regards,
> Veena.
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Regarding Partition Map exchange Triggers

2020-12-11 Thread Pavel Kovalenko
Hi,

I think it's wrong information on the wiki that PME is not triggered for
some cases. It should be fixed.
Actually, PME is triggered in all cases but for some of them it doesn't
block cache operations or the time of blocking is minimized.
Most optimizations for minimizing the blocking time of PME have been done
in Ignite 2.8.

Thick client join/left PME - doesn't block operations at all.

Other events can be ordered by their potential blocking time:
1. Non-baseline node left/join - minimal
2. Baseline node stop/left
3. Baseline node join
4. Baseline change - heaviest operation

> *for the end user , is this invoked when we do ignite.getOrCreate( xx )
and ignite.cache(xx )*

Yes.

пт, 11 дек. 2020 г. в 14:55, VeenaMithare :

> Hi ,
>
>
> I can see the triggers for PME initiation here :
>
> https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood
>
> Triggers
> Events which causes exchange
>
> Topology events:
>
> Node Join (EVT_NODE_JOINED) - new node discovered and joined topology
> (exchange is done after a node is included into the ring). This event
> doesn't trigger the PME if a thick client connects the cluster and an
> Ignite
> version is 2.8 or later.
>
>
> --> *This means in ignite 2.8 or higher, this is triggered only if nodes
> that participate in the baseline topology are added ?*
>
>
> Node Left (EVT_NODE_LEFT) - correct shutdown with call ignite.close. This
> event doesn't trigger the PME in Ignite 2.8 and later versions if a node
> belonging to an existing baseline topology leaves.
>
> --> *This means this is not triggered at all 2.8.1 or higher if shutdown
> cleanly ? i.e. if this is called : Ignition.stop(false) *
>
>
> Node Failed (EVT_NODE_FAILED) - detected unresponsive node, probably
> crashed
> and is considered failed
>
> --> *This means this is  triggered at all 2.8.1 or higher for baseline
> nodes
> or any thick client node ?*
>
> Custom events:
>
> Activation / Deactivation / Baseline topology set -
> ChangeGlobalStateMessage
> Dynamic cache start / Dynamic cache stop - DynamicCacheChangeBatch
>
> --> *for the end user , is this invoked when we do ignite.getOrCreate( xx )
> and ignite.cache(xx )*
>
>
> Snapshot create / restore - SnapshotDiscoveryMessage
> Global WAL enable / disable - WalStateAbstractMessage
> Late affinity assignment - CacheAffinityChangeMessage
>
>
> regards,
> Veena.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2020-05-01 Thread Pavel Kovalenko
Hello,

I don't clearly understand from your message, but have the exchange finally
finished? Or you were getting this WARN message all the time?

пт, 1 мая 2020 г. в 12:32, Ilya Kasnacheev :

> Hello!
>
> This description sounds like a typical hanging Partition Map Exchange, but
> you should be able to see that in logs.
> If you don't, you can collect thread dumps from all nodes with jstack and
> check it for any stalling operations (or share with us).
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 1 мая 2020 г. в 11:53, userx :
>
>> Hi Pavel,
>>
>> I am using 2.8 and still getting the same issue. Here is the ecosystem
>>
>> 19 Ignite servers (S1 to S19) running at 16GB of max JVM and in persistent
>> mode.
>>
>> 96 Clients (C1 to C96)
>>
>> There are 19 machines, 1 Ignite server is started on 1 machine. The
>> clients
>> are evenly distributed across machines.
>>
>> C19 tries to create a cache, it gets a timeout exception as i have 5 mins
>> of
>> timeout. When I looked into the coordinator logs, between a span of 5
>> minutes, it gets the messages
>>
>>
>> 2020-04-24 15:37:09,434 WARN [exchange-worker-#45%S1%] {}
>>
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
>> - Unable to await partitions release latch within timeout. Some nodes have
>> not sent acknowledgement for latch completion. It's possible due to
>> unfinishined atomic updates, transactions or not released explicit locks
>> on
>> that nodes. Please check logs for errors on nodes with ids reported in
>> latch
>> `pendingAcks` collection [latch=ServerLatch [permits=4,
>> pendingAcks=HashSet
>> [84b8416c-fa06-4544-9ce0-e3dfba41038a,
>> 19bd7744-0ced-4123-a35f-ddf0cf9f55c4,
>> 533af8f9-c0f6-44b6-92d4-658f86ffaca0,
>> 1b31cb25-abbc-4864-88a3-5a4df37a0cf4],
>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>> topVer=AffinityTopologyVersion [topVer=174, minorTopVer=1]
>>
>> And the 4 nodes which have not been able to acknowledge latch completion
>> are
>> S14, S7, S18, S4
>>
>> I went to see the logs of S4, it just records the addition of C19 into
>> topology and then C19 leaving it after 5 minutes. The only thing is that
>> in
>> GC I see this consistently "Total time for which application threads were
>> stopped: 0.0006225 seconds, Stopping threads took: 0.887 seconds"
>>
>> I understand that until the time all the atomic updates and transactions
>> are
>> finished Clients are not able to create caches by communicating with
>> Coordinator but is there a way around ?
>>
>> So the question is that is it still prevalent on 2.8 ?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>


Re: excessive timeouts and load on new cache creations

2019-11-22 Thread Pavel Kovalenko
Hi Ibrahim,

I see you have 317 cache groups in your cluster `Full map updating for 317
groups performed in 105 ms.`
Each cache group has own partition map and affinity map that require memory
which resides in old-gen.
During cache creation, a distributed PME happens and all partition and
affinity maps are updated.
This results in huge memory consumption and leads to long GC pauses.

It' recommended to use as few cache groups as possible. If your caches have
the same affinity distribution you can place them at one cache group.
It should help to reduce memory consumption.



чт, 21 нояб. 2019 г. в 19:24, ihalilaltun :

> Hi Anton,
>
> Timeouts can be found at the logs that i shared;
>
> [query-#13207879][GridMapQueryExecutor] Failed to execute local query.
> org.apache.ignite.cache.query.QueryCancelledException: The query was
> cancelled while executing.
>
> huge loads on server nodes are monitored via zabbix agent;
> 
>
>
> just after cache creation we cannot return to requests, these metrics are
> monitored via prometheus, here is the SS;
> 
>
> for some reason, timeouts occur after cache proxy initializations (cache
> creations)
>
>
>
> -
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2019-10-11 Thread Pavel Kovalenko
Ibrahim,

I've checked logs and found the following issue:
[2019-09-27T15:00:06,164][ERROR][sys-stripe-32-#33][atomic] Received
message without registered handler (will ignore)
[msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1,
arr=[6389728]]], node=e39bd72e-acee-48a7-ad45-2019dfff9df4,
locTopVer=AffinityTopologyVersion [topVer=92, minorTopVer=1], ...

This response was needed to complete (finish) AtomicUpdateFuture:
[2019-09-27T15:00:36,287][WARN ][exchange-worker-#219][diagnostic] >>>
GridDhtAtomicSingleUpdateFuture [allUpdated=true,
super=GridDhtAtomicAbstractUpdateFuture [futId=6389728, resCnt=0,
addedReader=false, dhtRes={e39bd72e-acee-48a7-ad45-2019dfff9df4=[res=false,
size=1, nearSize=0]}]]

During exchange, all nodes wait for atomic updates and transaction
completion and then send an acknowledgment to the coordinator to continue
processing exchange.
Because atomic update on that node was not finished, the node didn't send
the acknowledgement to the coordinator and that's why you have seen
messages like:
[2019-09-27T15:00:17,727][WARN ][exchange-worker-#219][
GridDhtPartitionsExchangeFuture] Unable to await partitions release latch
within timeout: ServerLatch [permits=1, pendingAcks=[
*3561ac09-6752-4e2e-8279-d975c268d045*], super=CompletableLatch
[id=exchange, topVer=AffinityTopologyVersion [topVer=92, minorTopVer=2]]]

The handler to complete AtomicUpdateFuture was not found due to the
concurrency issue in 2.7.6 codebase. There is a map that contains handlers
for cache messages:
org/apache/ignite/internal/processors/cache/GridCacheIoManager.java:1575
In 2.7.6 it's just HashMap with volatile read/write publishing. However,
because of improper synchronization with adding and getting a handler in
rare cases, it can lead to false-positive missing a handler for a message
that you may see in logs.
This issue was fixed at https://issues.apache.org/jira/browse/IGNITE-8006 which
will be in 2.8 release.
However, if it's critical, you can make a hotfix by yourself:
Checkout ignite-2.7.6 branch from https://github.com/apache/ignite
Change HashMap declaration to ConcurrentHashMap here:
org/apache/ignite/internal/processors/cache/GridCacheIoManager.java:1575
Rebuild ignite-core module and deploy new ignite-core-jar on your server
nodes.
This hotfix will work for your case.

Another option is you can use the last version of GridGain Community
Edition instead of Apache Ignite which is fully compatible with Ignite.

Regarding message:
[sys-#337823][GridDhtPartitionsExchangeFuture] Partition states validation
has failed for group: acc_1306acd07be78000_userPriceDrop. Partitions
cachesizes are inconsistent for Part 129

I see that you create caches with ExpiryPolicy. If you use expiry policies
you can have different partition sizes on primary-backup nodes, because
expiring is not synchronized and performed independently on different nodes.
So it's OK to see such warnings. They are false-positive. Such warning
messages will not be printed if a cache has an expiry policy set. That was
fixed in https://issues.apache.org/jira/browse/IGNITE-12206


пт, 11 окт. 2019 г. в 14:40, ihalilaltun :

> Hi Pavel,
>
> Here is the logs from node with
> localId:3561ac09-6752-4e2e-8279-d975c268d045
> ignite-2019-10-06.gz
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite-2019-10-06.gz>
>
>
> cache creation is done with java code on our side, we use getOrCreateCache
> method, here is the piece of code how we configure and create caches;
>
> ...
> ignite.getOrCreateCache(getCommonCacheConfigurationForAccount(accountId,
> initCacheType));
>
> private  CacheConfiguration
> getCommonCacheConfigurationForAccount(String accountId, IgniteCacheType
> cacheType) {
> CacheConfiguration cacheConfiguration = new
> CacheConfiguration<>();
>
>
> cacheConfiguration.setName(accountId.concat(cacheType.getCacheNameSuffix()));
> if (cacheType.isSqlTable()) {
> cacheConfiguration.setIndexedTypes(cacheType.getKeyClass(),
> cacheType.getValueClass());
> cacheConfiguration.setSqlSchema(accountId);
> cacheConfiguration.setSqlEscapeAll(true);
> }
> cacheConfiguration.setEventsDisabled(true);
> cacheConfiguration.setStoreKeepBinary(true);
> cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
> cacheConfiguration.setBackups(1);
> if (!cacheType.getCacheGroupName().isEmpty()) {
> cacheConfiguration.setGroupName(cacheType.getCacheGroupName());
> }
> if (cacheType.getExpiryDurationInDays().getDurationAmount() > 0) {
>
>
> cacheConfiguration.setExpiryPolicyFactory(TouchedExpiryPolicy.factoryOf(cacheType.getExpiryDurationInDays()));
> }
> return cacheConfiguration;
> }
>
>
>
> -
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2019-10-10 Thread Pavel Kovalenko
Ibrahim,

Could you please also share the cache configuration that is used for
dynamic creation?

чт, 10 окт. 2019 г. в 19:09, Pavel Kovalenko :

> Hi Ibrahim,
>
> I see that one node didn't send acknowledgment during cache creation:
> [2019-09-27T15:00:17,727][WARN
> ][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Unable to await
> partitions release latch within timeout: ServerLatch [permits=1,
> pendingAcks=[*3561ac09-6752-4e2e-8279-d975c268d045*],
> super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion
> [topVer=92, minorTopVer=2]]]
>
> Do you have any logs from a node with id =
> "3561ac09-6752-4e2e-8279-d975c268d045".
> You can find this node by grepping the following
> "locNodeId=3561ac09-6752-4e2e-8279-d975c268d045" like in line:
> [2019-09-27T15:24:03,532][INFO ][main][TcpDiscoverySpi] Successfully bound
> to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0,*
> locNodeId=70b49e00-5b9f-4459-9055-a05ce358be10*]
>
>
> ср, 9 окт. 2019 г. в 17:34, ihalilaltun :
>
>> Hi There Igniters,
>>
>> We had a very strange cluster behivour while creating new caches on the
>> fly.
>> Just after caches are created we start get following warnings from all
>> cluster nodes, including coordinator node;
>>
>> [2019-09-27T15:00:17,727][WARN
>> ][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Unable to await
>> partitions release latch within timeout: ServerLatch [permits=1,
>> pendingAcks=[3561ac09-6752-4e2e-8279-d975c268d045], super=CompletableLatch
>> [id=exchange, topVer=AffinityTopologyVersion [topVer=92, minorTopVer=2]]]
>>
>> After a while all client nodes are seemed to disconnected from cluster
>> with
>> no logs on clients' side.
>>
>> Coordinator node has many logs like;
>> 2019-09-27T15:00:03,124][WARN
>> ][sys-#337823][GridDhtPartitionsExchangeFuture] Partition states
>> validation
>> has failed for group: acc_1306acd07be78000_userPriceDrop. Partitions cache
>> sizes are inconsistent for Part 129:
>> [9497f1c4-13bd-4f90-bbf7-be7371cea22f=757
>> 1486cd47-7d40-400c-8e36-b66947865602=2427 ] Part 138:
>> [1486cd47-7d40-400c-8e36-b66947865602=2463
>> f9cf594b-24f2-4a91-8d84-298c97eb0f98=736 ] Part 156:
>> [b7782803-10da-45d8-b042-b5b4a880eb07=672
>> 9f0c2155-50a4-4147-b444-5cc002cf6f5d=2414 ] Part 284:
>> [b7782803-10da-45d8-b042-b5b4a880eb07=690
>> 1486cd47-7d40-400c-8e36-b66947865602=1539 ] Part 308:
>> [1486cd47-7d40-400c-8e36-b66947865602=2401
>> 7750e2f1-7102-4da2-9a9d-ea202f73905a=706 ] Part 362:
>> [1486cd47-7d40-400c-8e36-b66947865602=2387
>> 7750e2f1-7102-4da2-9a9d-ea202f73905a=697 ] Part 434:
>> [53c253e1-ccbe-4af1-a3d6-178523023c8b=681
>> 1486cd47-7d40-400c-8e36-b66947865602=1541 ] Part 499:
>> [1486cd47-7d40-400c-8e36-b66947865602=2505
>> 7750e2f1-7102-4da2-9a9d-ea202f73905a=699 ] Part 622:
>> [1486cd47-7d40-400c-8e36-b66947865602=2436
>> e97a0f3f-3175-49f7-a476-54eddd59d493=662 ] Part 662:
>> [b7782803-10da-45d8-b042-b5b4a880eb07=686
>> 1486cd47-7d40-400c-8e36-b66947865602=2445 ] Part 699:
>> [1486cd47-7d40-400c-8e36-b66947865602=2427
>> f9cf594b-24f2-4a91-8d84-298c97eb0f98=646 ] Part 827:
>> [62a05754-3f3a-4dc8-b0fa-53c0a0a0da63=703
>> 1486cd47-7d40-400c-8e36-b66947865602=1549 ] Part 923:
>> [1486cd47-7d40-400c-8e36-b66947865602=2434
>> a9e9eaba-d227-4687-8c6c-7ed522e6c342=706 ] Part 967:
>> [62a05754-3f3a-4dc8-b0fa-53c0a0a0da63=673
>> 1486cd47-7d40-400c-8e36-b66947865602=1595 ] Part 976:
>> [33301384-3293-417f-b94a-ed36ebc82583=666
>> 1486cd47-7d40-400c-8e36-b66947865602=2384 ]
>>
>> Coordinator's log and one of the cluster node's log is attached.
>> coordinator_log.gz
>> <
>> http://apache-ignite-users.70518.x6.nabble.com/file/t2515/coordinator_log.gz>
>>
>> cluster_node_log.gz
>> <
>> http://apache-ignite-users.70518.x6.nabble.com/file/t2515/cluster_node_log.gz>
>>
>>
>> Any help/comment is appriciated.
>>
>> Thanks.
>>
>>
>>
>>
>>
>> -
>> İbrahim Halil Altun
>> Senior Software Engineer @ Segmentify
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>


Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2019-10-10 Thread Pavel Kovalenko
Hi Ibrahim,

I see that one node didn't send acknowledgment during cache creation:
[2019-09-27T15:00:17,727][WARN
][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Unable to await
partitions release latch within timeout: ServerLatch [permits=1,
pendingAcks=[*3561ac09-6752-4e2e-8279-d975c268d045*],
super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion
[topVer=92, minorTopVer=2]]]

Do you have any logs from a node with id =
"3561ac09-6752-4e2e-8279-d975c268d045".
You can find this node by grepping the following
"locNodeId=3561ac09-6752-4e2e-8279-d975c268d045" like in line:
[2019-09-27T15:24:03,532][INFO ][main][TcpDiscoverySpi] Successfully bound
to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0,*
locNodeId=70b49e00-5b9f-4459-9055-a05ce358be10*]


ср, 9 окт. 2019 г. в 17:34, ihalilaltun :

> Hi There Igniters,
>
> We had a very strange cluster behivour while creating new caches on the
> fly.
> Just after caches are created we start get following warnings from all
> cluster nodes, including coordinator node;
>
> [2019-09-27T15:00:17,727][WARN
> ][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Unable to await
> partitions release latch within timeout: ServerLatch [permits=1,
> pendingAcks=[3561ac09-6752-4e2e-8279-d975c268d045], super=CompletableLatch
> [id=exchange, topVer=AffinityTopologyVersion [topVer=92, minorTopVer=2]]]
>
> After a while all client nodes are seemed to disconnected from cluster with
> no logs on clients' side.
>
> Coordinator node has many logs like;
> 2019-09-27T15:00:03,124][WARN
> ][sys-#337823][GridDhtPartitionsExchangeFuture] Partition states validation
> has failed for group: acc_1306acd07be78000_userPriceDrop. Partitions cache
> sizes are inconsistent for Part 129:
> [9497f1c4-13bd-4f90-bbf7-be7371cea22f=757
> 1486cd47-7d40-400c-8e36-b66947865602=2427 ] Part 138:
> [1486cd47-7d40-400c-8e36-b66947865602=2463
> f9cf594b-24f2-4a91-8d84-298c97eb0f98=736 ] Part 156:
> [b7782803-10da-45d8-b042-b5b4a880eb07=672
> 9f0c2155-50a4-4147-b444-5cc002cf6f5d=2414 ] Part 284:
> [b7782803-10da-45d8-b042-b5b4a880eb07=690
> 1486cd47-7d40-400c-8e36-b66947865602=1539 ] Part 308:
> [1486cd47-7d40-400c-8e36-b66947865602=2401
> 7750e2f1-7102-4da2-9a9d-ea202f73905a=706 ] Part 362:
> [1486cd47-7d40-400c-8e36-b66947865602=2387
> 7750e2f1-7102-4da2-9a9d-ea202f73905a=697 ] Part 434:
> [53c253e1-ccbe-4af1-a3d6-178523023c8b=681
> 1486cd47-7d40-400c-8e36-b66947865602=1541 ] Part 499:
> [1486cd47-7d40-400c-8e36-b66947865602=2505
> 7750e2f1-7102-4da2-9a9d-ea202f73905a=699 ] Part 622:
> [1486cd47-7d40-400c-8e36-b66947865602=2436
> e97a0f3f-3175-49f7-a476-54eddd59d493=662 ] Part 662:
> [b7782803-10da-45d8-b042-b5b4a880eb07=686
> 1486cd47-7d40-400c-8e36-b66947865602=2445 ] Part 699:
> [1486cd47-7d40-400c-8e36-b66947865602=2427
> f9cf594b-24f2-4a91-8d84-298c97eb0f98=646 ] Part 827:
> [62a05754-3f3a-4dc8-b0fa-53c0a0a0da63=703
> 1486cd47-7d40-400c-8e36-b66947865602=1549 ] Part 923:
> [1486cd47-7d40-400c-8e36-b66947865602=2434
> a9e9eaba-d227-4687-8c6c-7ed522e6c342=706 ] Part 967:
> [62a05754-3f3a-4dc8-b0fa-53c0a0a0da63=673
> 1486cd47-7d40-400c-8e36-b66947865602=1595 ] Part 976:
> [33301384-3293-417f-b94a-ed36ebc82583=666
> 1486cd47-7d40-400c-8e36-b66947865602=2384 ]
>
> Coordinator's log and one of the cluster node's log is attached.
> coordinator_log.gz
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2515/coordinator_log.gz>
>
> cluster_node_log.gz
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2515/cluster_node_log.gz>
>
>
> Any help/comment is appriciated.
>
> Thanks.
>
>
>
>
>
> -
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-07 Thread Pavel Kovalenko
Mahesh,

Assertion error occurs if you run node with enabled assertions (jvm flag
-ea). If assertions are disabled it leads to NullPointer exception as you
have in logs.

сб, 5 окт. 2019 г. в 16:47, Mahesh Renduchintala <
mahesh.renduchint...@aline-consulting.com>:

> Pavel, I don't have the logs for the client node. It happened 2 times in
> our cluster till now in 45 days. Difficult to reproduce.
> But the logs show a null point exception on server nodes... 1st one server
> node (192.168.1.6) went down and then the other.
>
> In 12255, it is noted that an assertion could be seen on the coordinator,
> but this is a null pointer exception.
> Agree, the race condition, described in 12255 seems similar to the logs i
> attached. But just does not explain the null pointer exception.
>
> The race is the following:
>
> Client node (with some configured caches) joins to a cluster sending
> SingleMessage to coordinator during client PME. This SingleMessage contains
> affinity fetch requests for all cluster caches. When SingleMessage is
> in-flight server nodes finish client PME and also process and finish cache
> destroy PME. When a cache is destroyed affinity for that cache is cleared.
> When SingleMessage delivered to coordinator it doesn’t have affinity for a
> requested cache because the cache is already destroyed. *It leads to
> assertion error on the coordinator* and unpredictable behavior on the
> client node.
>
>
>


Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-04 Thread Pavel Kovalenko
Mahesh,

Do you have logs from the following thick client?
TcpDiscoveryNode [id=5204d16d-e6fc-4cc3-a1d9-17edf59f961e,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.171],
sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.1.171:0],
discPort=0, order=1146, intOrder=579, lastExchangeTime=1569947734191,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, *isClient=true*]
I need to check it, may be I'm missing something.

пт, 4 окт. 2019 г. в 05:08, Mahesh Renduchintala <
mahesh.renduchint...@aline-consulting.com>:

> Hello Pavel,
>
> OK. I am a little bit not clear on the workaround you suggested on your
> previous comment
> As a workaround, I can suggest to not explicitly declare caches in the
> client configuration. During joining to cluster process, the client node
> will receive all configured caches from server nodes.
>
> In my scenario,
> a) there are absolutely no caches declared on my thick client side.
> b) The cache templates are declared on the server nodes and via SQL
> generated from thick client side, the caches are created.
>
> How do I implement the workaround you suggested?
>
> regards
> Mahesh
>
>


Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-03 Thread Pavel Kovalenko
Mahesh,

According to your logs and exception what I see, the issue you mentioned is
not related to your problem.
The similar with IGNITE-10010 problem is
https://issues.apache.org/jira/browse/IGNITE-9562

You have thick client join to server topology:
[16:35:34,948][INFO][disco-event-worker-#50][GridDiscoveryManager] Added
new node to topology: TcpDiscoveryNode
[id=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 192.168.1.171], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /
192.168.1.171:0], discPort=0, order=1146, intOrder=579,
lastExchangeTime=1569947734191, loc=false,
ver=2.7.6#20190911-sha1:21f7ca41, *isClient=true*]
Which causes Partitions Map Exchange on version *[1146, 0]*:
[16:35:34,949][INFO][exchange-worker-#51][time] Started exchange init
[topVer=AffinityTopologyVersion *[topVer=1146, minorTopVer=0]*,
mvccCrd=MvccCoordinator [nodeId=84de670f-49e6-4dd8-9d14-4855fdd5acdf,
crdVer=1569681573983, topVer=AffinityTopologyVersion [topVer=2,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, customEvt=null,
allowMerge=true]
Right after you have 2 cache destroy events.
And the server node is down during process a single message from the thick
client on version *[1146, 0]*:
[16:36:08,567][SEVERE][sys-#37524][GridCacheIoManager] Failed processing
message [senderId=5204d16d-e6fc-4cc3-a1d9-17edf59f961e,
msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null,
partsSizes=null, partHistCntrs=null, err=null, client=true, finishMsg=null,
activeQryTrackers=null, super=GridDhtPartitionsAbstractMessage
[exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion
*[topVer=1146,
minorTopVer=0]*, discoEvt=null, nodeId=5204d16d, evt=NODE_JOINED],
lastVer=GridCacheVersion [topVer=181162717, order=1569940014325,
nodeOrder=1144], super=GridCacheMessage [msgId=7894, depInfo=null,
err=null, skipPrepare=false
java.lang.NullPointerException
This is exactly the same problem described in ticket I mentioned in
previous message.


чт, 3 окт. 2019 г. в 15:04, Mahesh Renduchintala <
mahesh.renduchint...@aline-consulting.com>:

> Pavel, Thanks for your analysis. The two logs, that I attached, are those
> of two server data nodes (none are configured in thick client mode).
> The logs did show a server data node, losing connection and try to connect
> back to the other node (192.168.1.6)...
>
> On second thoughts, the below still makes sense.
> https://issues.apache.org/jira/browse/IGNITE-10010
>
> Please check.
>
>


Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-03 Thread Pavel Kovalenko
Hi Mahesh,

Your problem is described here:
https://issues.apache.org/jira/browse/IGNITE-12255
The section starts with "This solution showed the existing race between
client node join and concurrent cache destroy."
According to your logs, I see concurrent client node join and stop caches
"SQL_PUBLIC_INCOME_DATASET_MALLIKARJUNA" and "income_dataset_Mallikarjuna".
I think some of them are configured on the client node explicitly.

This problem is already fixed in an open-source fork of Ignite and will be
donated to Ignite soon.
As a workaround, I can suggest to not explicitly declare caches in the
client configuration. During joining to cluster process client node will
receive all configured caches from server nodes.


ср, 2 окт. 2019 г. в 12:17, Mahesh Renduchintala <
mahesh.renduchint...@aline-consulting.com>:

> This seems to be a new bug, and unrelated to IGNITE-10010.
> Both the nodes were fully operational when the null pointer exception
> happened.
> The logs show that and both the nodes crashed
>
> Can you give some insights into this, possible scenarios this could have
> led this?
> Is there any potential workaround?
>
>


Re: Using Ignite as blob store?

2019-08-23 Thread Pavel Kovalenko
Denis,

You can't set page size greater than 16Kb due to our page memory
limitations.

чт, 22 авг. 2019 г. в 22:34, Denis Magda :

> How about setting page size to more KBs or MBs based on the average value?
> That should work perfectly fine.
>
> -
> Denis
>
>
> On Thu, Aug 22, 2019 at 8:11 AM Shane Duan  wrote:
>
>> Thanks, Ilya. The blob size varies from a few KBs to a few MBs.
>>
>> Cheers,
>> Shane
>>
>>
>> On Thu, Aug 22, 2019 at 5:02 AM Ilya Kasnacheev <
>> ilya.kasnach...@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> How large are these blobs? Ignite is going to divide blobs into <4k
>>> chunks. We have no special optimizations for storing large key-value pairs.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> чт, 22 авг. 2019 г. в 02:53, Shane Duan :
>>>
 Hi Igniters, is it a good idea to use Ignite(with persistence) as a
 blob store? I did run some testing with a small dataset, and it looks
 performing okay, even with a small off-heap mem for the data region.

 Thanks!

 Shane

>>>


Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread Pavel Kovalenko
This sounds strange. There definetely should be a cause of such behaviour.
Rebalancing is happened only after an topology change (node join/leave,
deactivation/activation).
Could you please share logs from node with exception you mentioned in
message, node with id "5423e6b5-c9be-4eb8-8f68-e643357ec2b3", and
coordinator (oldest) node (you can find this node grepping "crd=true" in
logs) to find the root cause of such behaviour?
Cache configurations / Data storage configurations would be also very
useful to debug.

1) If rebalancing didn't happen you should notice MOVING partitions in your
cache groups (from metrics MxBeans or Visor). It's possible to write data
to such partitions and read (it depends on configured PartitionLossPolicy
in your caches). If you have at least 1 owner (OWNING state) for each of
such replicated partition there is no data loss. Such MOVING partitions
will be properly rebalanced after node restart and data become consistent
in primary-backups partitions.
2) If part*.bin files are corrupted you may notice it only during node
restart or subsequent cluster deactivation/activation or if you have less
RAM than your data size and node do pages swapping (replacing) to/from
disk. In usual cluster life this is undetectable since all data placed in
RAM.


ср, 26 дек. 2018 г. в 13:44, aMark :

> Thanks Pavel for prompt response.
>
> I could confirm that node "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" (and no
> other node in the cluster) did not go down, not sure how did stale data
> cropped up on few nodes.  And this type of exception is coming from every
> server node in the cluster.
>
> What happens if re-balancing did not happen properly due to this exception,
> could it lead to data loss ?
> does data get corrupted on the part*.bin files (in persistent store) in the
> Ignite cache due to this exception ?
>
> Thanks,
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread Pavel Kovalenko
Hello,

It means that node with id "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" has
outdated data (possibly due to restart) and started to rebalance missed
updates from a node with up-to-date data (where you have exception) using
WAL.
WAL rebalance is used when the number of entries in some partition exceeds
threshold controlled by system property IGNITE_PDS_WAL_REBALANCE_THRESHOLD
, default value of that is 500k entries. WAL rebalance is very efficient
when node has a lot of data and was in short period of down-time.
Unfortunately this mechanism is currently unstable and may lead to such
errors you noticed. A very few users have such amount of data in
persistence in 1 partition. There are a couple of tickets [1], [2], [3]
which should be fixed in 2.8 release and make it more robust.

To avoid such problem you should set JVM system property
IGNITE_PDS_WAL_REBALANCE_THRESHOLD value to some very high threshold (e.g.
2kk) in all Ignite instances and perform rolling restart. In this case
default full rebalance will be used. It's slower but durable approach.

[1] https://issues.apache.org/jira/browse/IGNITE-8459
[2] https://issues.apache.org/jira/browse/IGNITE-8391
[3] https://issues.apache.org/jira/browse/IGNITE-10078

ср, 26 дек. 2018 г. в 11:19, aMark :

> Hi,
>
> We are using Ignite 2.6 as persistent store in Partitioned Mode having 12
> server node running in cluster, each node is running on different machine.
>
> There are around 48 client JVM as well which connect to cluster to fetch
> the
> data.
>
> Recently we have started getting following exception on server nodes
> (Though
> clients are still able to read/write data):
>
> 2018-12-25 02:59:48,423 ERROR
> [sys-#22846%a738c793-6e94-48cc-b6cf-d53ccab5f0fe%] {}
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier
> - Failed to send partition supply message to node:
> 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class
> org.apache.ignite.IgniteCheckedException: Could not find start pointer for
> partition [part=9, partCntrSince=484857]
> at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:792)
> at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:90)
> at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.rebalanceIterator(IgniteCacheOffheapManagerImpl.java:893)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:283)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
> at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
> at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
>
> Does someone has any idea about the exception and possible resolution as
> well ?
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Full GC in client after cluster become unstable with "Unable to await partitions release latch within timeout"

2018-12-24 Thread Pavel Kovalenko
Hello,

Could you please attach additional logs from a coordinator node? An id of
that node you may notice in "Unable to await partitions release latch"
message.
Also, it would be good to have logs from the client machine and from any
other server node in the cluster.

пн, 24 дек. 2018 г. в 09:13, aMark :

> Hi,
>
> We are using Ignite 2.6 as persistent store in Partitioned Mode having 6
> cluster node running, each node is running on different machine.
>
> We have noticed that on all the server nodes were trying to rebalance due
> to
> 'too many dirty pages':
> 2018-12-22 14:56:17,161 INFO
> [db-checkpoint-thread-#104%d66a2109-94b4-4eb3-bb3c-e611aa842a2a%] {}
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager
> - Checkpoint started [
> checkpointId=cabfb20d-d53d-4dd5-8e2b-d270f062c010, startPtr=FileWALPointer
> [idx=5025, fileOff=65789181, len=5487], checkpointLockWait=34ms,
> checkpointLockHoldTime=20ms, walCpRecordFsyncDuration=68ms, pages=1
> 820016, reason='too many dirty pages']
>
> then I can see following log after a minute:
> 2018-12-22 14:57:26,040 WARN
> [exchange-worker-#102%d66a2109-94b4-4eb3-bb3c-e611aa842a2a%] {}
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
> - Unable to await partitions release latch within timeout: ClientLatch
>
>
> After this cluster became unresponsive for 30 minutes and timeout started
> happening while writing to cluster at the client end.
> But at the same time number of ignite related objects increased by many
> threshold over the time.
>
> Following is histogram from one of the client machine :
> 1: 67551977016212474480
> org.apache.ignite.internal.util.future.GridFutureAdapter$Node
> 2:   6637390 3837988712  [Ljava.lang.Object;
> 3:   6557471  262298840
> org.apache.ignite.internal.util.future.GridCompoundFuture
> 4:   6627708  159064992  java.util.ArrayList
> 5:177242   36609304  [B
>
> And eventually client machine went in Full GC mode.
> Following is the code to write in the ignite cache :
>  try(IgniteDataStreamer streamer =
> ignite.dataStreamer(cacheName)){
> igniteMap.forEach((key,value) -> streamer.addData(key, value));
>
>
> }catch(CacheException|IgniteInterruptedException|IllegalStateException|IgniteDataStreamerTimeoutException
> e){
> ignite.log().error("Entries not written to Ignite Cache, please
> check the logs.");
> throw new IgniteException(e);
>   }
>
>
> Any help will be much appreciated.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: ZookeeperDiscovery block when communication error

2018-11-19 Thread Pavel Kovalenko
Hello Wangsan,

Seems it's known issue https://issues.apache.org/jira/browse/IGNITE-9493 .

пн, 12 нояб. 2018 г. в 18:06, wangsan :

> I have a server node in zone A ,then I start a client from zone B, Now
> access
> between A,B was controlled by firewall,The acl is B can access A,but A can
> not access B.
> So when client in zone B join the cluster,the communication will fail
> caused
> by firewall.
>
> But when client in zone B closed, The cluster will be crashed(hang on new
> join even from same zone without fireWall). And when restart the
> coordinator
> server(If I started two servers in Zone A) .Another server will hang with
> communication.
>
> Looks like the whole cluster crashed when a node join failed by firewall.
>
> But when I use tcpDiscovery, I didn't saw the cluster crash. Just saw some
> communication errors,And when new node join,It still be well.
>
> Is this a ZookeeperDiscovery bug?
>
> The log is : zkcommuerror.log
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t1807/zkcommuerror.log>
>
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Long activation times with Ignite persistence enabled

2018-11-06 Thread Pavel Kovalenko
Hi Naveen and Andrey,

We've recently done major optimization
https://issues.apache.org/jira/browse/IGNITE-9420 that will speed-up
activation time in your case.
Iteration over WAL now happens only on a node start-up, so it will not
affect activation anymore.
Partitions state restoring (which is the slowest part of the activation
phase as I see in the first message in the thread) was also optimized.
Now it is performed in parallel for each of available cache groups.
Parallelism level of that operation is controlled by System Pool size.
If you have enough CPU cores on your machines (more than the number of
configured cache groups) you can adjust System pool size and your
activation time will be significantly improved.

вт, 6 нояб. 2018 г. в 17:23, Naveen :

> Hi Denis
>
> We have already reduced the partition to 128, after which activation time
> has come down a bit.
>
> You were saying that, by reducing the partitions, it may lead to uneven
> distribution of data between nodes. Isn't it the same when we go for cache
> groups, group of caches will use the same resources /partitions, so here
> also resource contention may be there right ?? here also same set of
> partitions used by group of caches ?
> If we use cache group, partition size may grow very high since all the
> caches belong to that group will use the same set of partitions, does it
> have any negative effect on the cluster performance ??
>
>
>
> Thanks
> Naveen
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Handling split brain with Zookeeper and persistence

2018-09-17 Thread Pavel Kovalenko
Hello Eugene,

1) Split brain resolver takes into account only server nodes (not client).
No difference between in-memory only or with persistence.
2) It's no necessary to immediately remove a node from baseline topology
after split-brain. If you lost backup factor for some partitions (All
partition owner primary/backup are in shut down part of a cluster) and
explicitly change shrink baseline topology, your affinity for all
partitions will be changed and you will lose the data contained in shut
down nodes.
When split-brain have resolved you may up shutting-down nodes and join them
to existing baseline topology, after that outdated data will re-balanced.
In the conclusion, you should carefully operate with baseline and do not
change it without necessary.

сб, 15 сент. 2018 г. в 23:58, Gaurav Bajaj :

> Hello,
>
> For second question above, you can have listener to listen AFTER_NODE_STOP
> event and take action as per your logic (in your case changing BLT).
>
> Regards,
> Gaurav
>
>
> On 14-Sep-2018 9:33 PM, "eugene miretsky" 
> wrote:
>
> Hi,
>
> What are best practices for handling split brain with persistence?
>
> 1) Does Zookeeper split brain resolver consider all nodes as the same
> (client, memory only, persistent). Ideally, we want to shut down persistent
> nodes only as last resort.
> 2) If a persistent node is shut down, we need to remove it from baseline
> topology. Are there events we can subscribe to?
>
> Cheers,
> Eugene
>
>
>


Re: Partition map exchange in detail

2018-09-12 Thread Pavel Kovalenko
Eugene,

In the case of Zookeeper Discovery is enabled and communication problem
between some nodes, a subset of problem nodes will be automatically killed
to reach cluster state where each node can communicate with each other
without problems. So, you're absolutely right, dead nodes will be removed
from a cluster and will not participate in PME.
IEP-25 is trying to solve a more general problem related only to PME.
Network problems are only part of the problem can happen during PME. A node
may break down before it even tried to send a message because of unexpected
exceptions (e.g. NullPointer, Runtime, Assertion e.g.). In general, IEP-25
tries to defend us from any kind of unexpected problems to make sure that
PME will not be blocked in that case and the cluster will continue to live.


ср, 12 сент. 2018 г. в 18:53, eugene miretsky :

> Hi Pavel,
>
> The issue we are discussing is PME failing because one node cannot
> communicate to another node, that's what IEP-25 is trying to solve. But in
> that case (where one node is either down, or there is a communication
> problem between two nodes) I would expect the split brain resolver to kick
> in, and shut down one of the nodes. I would also expect the dead node to be
> removed from the cluster, and no longer take part in PME.
>
>
>
> On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko 
> wrote:
>
>> Hi Eugene,
>>
>> Sorry, but I didn't catch the meaning of your question about Zookeeper
>> Discovery. Could you please re-phrase it?
>>
>> ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh :
>>
>>> Pavel K., can you please answer about Zookeeper discovery?
>>>
>>> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
>>> eugene.miret...@gmail.com> wrote:
>>>
>>>> Thanks for the patience with my questions - just trying to understand
>>>> the system better.
>>>>
>>>> 3) I was referring to
>>>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>>>> How come it doesn't get the node to shut down?
>>>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>>>> they are required in the state?
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh 
>>>> wrote:
>>>>
>>>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>>>> 4) Partition map states include update counters, which are incremented
>>>>> on every cache update and play important role in new state calculation. 
>>>>> So,
>>>>> technically, every cache operation can lead to partition map change, and
>>>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>>>> more complex system than Akka or Kafka and such simple solutions won't 
>>>>> work
>>>>> here (in general case). However, it is true that PME could be simplified 
>>>>> or
>>>>> completely avoid for certain cases and the community is currently working
>>>>> on such optimizations (
>>>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example).
>>>>>
>>>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>>>> eugene.miret...@gmail.com> wrote:
>>>>>
>>>>>> 2b) I had a few situations where the cluster went into a state where
>>>>>> PME constantly failed, and could never recover. I think the root cause 
>>>>>> was
>>>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>>>> reproduce it again and get back to you
>>>>>> 3) If a node is down, I would expect it to get detected and the node
>>>>>> to get removed from the cluster. In such case, PME should not even be
>>>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>>>> 4) Don't all partition map changes go through the coordinator? I
>>>>>> believe a lot of distributed systems work in this way (all decisions are
>>>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>>>> making all cluster membership changes, in Kafka the controller does the
>>>>>> leader election.
>>>>>>
>>>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh 
>>>>>> wrote:
>&

Re: a node fails and restarts in a cluster

2018-09-12 Thread Pavel Kovalenko
Hi Eugene,

I've reproduced your problem and filed a ticket for that:
https://issues.apache.org/jira/browse/IGNITE-9562

As a temporary workaround, I can suggest you delete persistence data
(cache.dat and partition files) related to that cache in starting node work
directory or don't destroy caches without necessary if your baseline is not
complete.

вт, 11 сент. 2018 г. в 16:50, es70 :

> Hi Pavel
>
> I've  prepared the logs you requested. Please download it from this link
>
> https://cloud.mail.ru/public/A9wK/bKGEXK397
>
> hope this will help
>
> regards,
> Evgeny
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Partition map exchange in detail

2018-09-12 Thread Pavel Kovalenko
Hi Eugene,

Sorry, but I didn't catch the meaning of your question about Zookeeper
Discovery. Could you please re-phrase it?

ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh :

> Pavel K., can you please answer about Zookeeper discovery?
>
> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
> eugene.miret...@gmail.com> wrote:
>
>> Thanks for the patience with my questions - just trying to understand the
>> system better.
>>
>> 3) I was referring to
>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>> How come it doesn't get the node to shut down?
>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>> they are required in the state?
>>
>> Cheers,
>> Eugene
>>
>>
>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh 
>> wrote:
>>
>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>> 4) Partition map states include update counters, which are incremented
>>> on every cache update and play important role in new state calculation. So,
>>> technically, every cache operation can lead to partition map change, and
>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>> more complex system than Akka or Kafka and such simple solutions won't work
>>> here (in general case). However, it is true that PME could be simplified or
>>> completely avoid for certain cases and the community is currently working
>>> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
>>> for example).
>>>
>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>> eugene.miret...@gmail.com> wrote:
>>>
 2b) I had a few situations where the cluster went into a state where
 PME constantly failed, and could never recover. I think the root cause was
 that a transaction got stuck and didn't timeout/rollback.  I will try to
 reproduce it again and get back to you
 3) If a node is down, I would expect it to get detected and the node to
 get removed from the cluster. In such case, PME should not even be
 attempted with that node. Hence you would expect PME to fail very rarely
 (any faulty node will be removed before it has a chance to fail PME)
 4) Don't all partition map changes go through the coordinator? I
 believe a lot of distributed systems work in this way (all decisions are
 made by the coordinator/leader) - In Akka the leader is responsible for
 making all cluster membership changes, in Kafka the controller does the
 leader election.

 On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh 
 wrote:

> 1) It is.
> 2a) Ignite has retry mechanics for all messages, including PME-related
> ones.
> 2b) In this situation PME will hang, but it isn't a "deadlock".
> 3) Sorry, I didn't understand your question. If a node is down, but
> DiscoverySpi doesn't detect it, it isn't PME-related problem.
> 4) How can you ensure that partition maps on coordinator are *latest 
> *without
> "freezing" cluster state for some time?
>
> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
> eugene.miret...@gmail.com> wrote:
>
>> Thanks!
>>
>> We are using persistence, so I am not sure if shutting down nodes
>> will be the desired outcome for us since we would need to modify the
>> baseline topolgy.
>>
>> A couple more follow up questions
>>
>> 1) Is PME triggered when client nodes join us well? We are using
>> Spark client, so new nodes are created/destroy every time.
>> 2) It sounds to me like there is a pontential for the cluster to get
>> into a deadlock if
>>a) single PME message is lost (PME never finishes, there are no
>> retries, and all future operations are blocked on the pending PME)
>>b) one of the nodes has a  long running/stuck pending operation
>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>> detect the node being down? We are using ZookeeperSpi so I would expect 
>> the
>> split brain resolver to shut down the node.
>> 4) Why is PME needed? Doesn't the coordinator know the altest
>> toplogy/pertition map of the cluster through regualr gossip?
>>
>> Cheers,
>> Eugene
>>
>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh 
>> wrote:
>>
>>> Hi Eugene,
>>>
>>> 1) PME happens when topology is modified (TopologyVersion is
>>> incremented). The most common events that trigger it are: node
>>> start/stop/fail, cluster activation/deactivation, dynamic cache 
>>> start/stop.
>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>> 3) All nodes wait for all pending cache operations to finish and
>>> then send their local partition maps to the coordinator (oldest node). 
>>> Then
>>> coordinator calculates new global partition maps and sends them to every

Re: a node fails and restarts in a cluster

2018-09-07 Thread Pavel Kovalenko
Hello Evgeny,

Could you please attach full logs from both nodes in your case #2? Make
sure, that quiet mode is disabled (-DIGNITE_QUIET=false) to have full info
logs.

пт, 7 сент. 2018 г. в 17:41, es70 :

> I have a cluster of 2 ignite (version 2.6) nodes with enabled persistence
> (at
> the time of writing running on my windows machine for test puposes) and
> third  node (not in the cluster) which I use to run my app in the thick
> client mode. The app creates a cache (CacheMode.REPLICATED,
> CacheAtomicityMode.TRANSACTIONAL, CacheWriteSynchronizationMode.FULL_SYNC)
> to store messages being exchanged between the integrated systems.
>
> I start the messages exchange and brings down one of the nodes in the
> cluster (the app shows some execptions but continue working).
>
> I have two scenarios for the downed node to get it back on line
> 1. I start the downed node immediatelly after downing it.
> 2. I start the downed node some time (half an hour or so) after downing
> it.
>
> in the case #1 the node gets back to the cluster and starts working
> in the case #2 the node refuses to get back to the cluster and I have to
> restart the other node to get the cluster to the active state again.
> Besides
> I cannot connect to the cluster (with one node left) with my app and
> ignitevisorcmd.bat before the restart all the nodes
>
> java.lang.NullPointerException
> at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.clientTopology(GridCachePartitionExchangeManager.java:783)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updatePartitionSingleMap(GridDhtPartitionsExchangeFuture.java:3204)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2186)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:127)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2061)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2049)
> at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383)
> at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2049)
>
> apache-ignite-2.6.0>bin/control --baseline
> Control utility [ver. 2.6.0#20180710-sha1:669feacc]
> 2018 Copyright(C) Apache Software Foundation
> User: 
> --
> Connection to cluster failed.
> Error: Latest topology update failed.
>
> Regards
> Evgeny
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Proper config for IGFS eviction

2018-08-10 Thread Pavel Kovalenko
Hello Engrdean,

You should enable persistence on your DataRegionConfiguration to make it
possible to evict files metadata pages from memory to disk.

2018-08-09 19:49 GMT+03:00 engrdean :

> I've been struggling to find a configuration that works successfully for
> IGFS
> with hadoop filesystem caching.  Anytime I attempt to load more data than
> what will fit into memory on my Ignite node, the ignite process crashes.
>
> The behavior I am looking for is that old cache entries will be evicted
> when
> I try to write new data to IGFS that exceeds the available memory on the
> server.  I can see that my data is being persisted into HDFS, but I seem to
> be limited to the amount of physical memory on my Ignite server at the
> moment.  I am using the teragen example to generate the files on hadoop for
> the purposes of this test like so:
>
> time hadoop-ig jar
> /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
> teragen 1 igfs://i...@myserver.com/tmp/output1
>
> If I have systemRegionMaxSize set to a value less than the physical memory
> on my ignite server, then the message is something like this:
>
> /class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Out of
> memory in data region [name=sysMemPlc, initSize=1.0 GiB, maxSize=14.0 GiB,
> persistenceEnabled=false] Try the following:
>   ^-- Increase maximum off-heap memory size
> (DataRegionConfiguration.maxSize)
>   ^-- Enable Ignite persistence (DataRegionConfiguration.
> persistenceEnabled)
>   ^-- Enable eviction or expiration policies
> /
> If I increase the systemRegionMaxSize to a value greater than the physical
> memory on my ignite server, the message is something like this:
>
> /[2018-08-09 12:16:08,174][ERROR][igfs-#171][GridNearTxLocal] Heuristic
> transaction failure.
> class
> org.apache.ignite.internal.transactions.IgniteTxHeuristicCheckedException:
> Failed to locally write to cache (all transaction entries will be
> invalidated, however there was a window when entries for this transaction
> were visible to others): GridNearTxLocal [mappings=IgniteTxMappingsImpl [],
> nearLocallyMapped=false, colocatedLocallyMapped=true, needCheckBackup=null,
> hasRemoteLocks=false, trackTimeout=false, lb=null, thread=igfs-#171,
> mappings=IgniteTxMappingsImpl [], super=GridDhtTxLocalAdapter
> [nearOnOriginatingNode=false, nearNodes=[], dhtNodes=[],
> explicitLock=false,
> super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
> depEnabled=false, txState=IgniteTxStateImpl [activeCacheIds=[-313790114],
> recovery=false, txMap=[IgniteTxEntry [key=KeyCacheObjectImpl [part=504,
> val=IgfsBlockKey [fileId=c976b6f1561-689b0ba5-6920-4b52-a614-c2360d0acff4,
> blockId=52879, affKey=null, evictExclude=true], hasValBytes=true],
> cacheId=-313790114, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=504,
> val=IgfsBlockKey [fileId=c976b6f1561-689b0ba5-6920-4b52-a614-c2360d0acff4,
> blockId=52879, affKey=null, evictExclude=true], hasValBytes=true],
> cacheId=-313790114], val=[op=CREATE, val=CacheObjectByteArrayImpl
> [arrLen=65536]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> explicitVer=null, dhtVer=null, filters=[], filtersPassed=false,
> filtersSet=true, entry=GridDhtCacheEntry [rdrs=[], part=504,
> super=GridDistributedCacheEntry [super=GridCacheMapEntry
> [key=KeyCacheObjectImpl [part=504, val=IgfsBlockKey
> [fileId=c976b6f1561-689b0ba5-6920-4b52-a614-c2360d0acff4, blockId=52879,
> affKey=null, evictExclude=true], hasValBytes=true], val=null,
> startVer=1533830728270, ver=GridCacheVersion [topVer=145310277,
> order=1533830728270, nodeOrder=1], hash=-915370253,
> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc
> [locs=[GridCacheMvccCandidate [nodeId=6ed33eb9-2103-402c-
> afab-a415c8f08f2f,
> ver=GridCacheVersion [topVer=145310277, order=1533830728268, nodeOrder=1],
> threadId=224, id=258264, topVer=AffinityTopologyVersion [topVer=1,
> minorTopVer=0], reentry=null,
> otherNodeId=6ed33eb9-2103-402c-afab-a415c8f08f2f,
> otherVer=GridCacheVersion
> [topVer=145310277, order=1533830728268, nodeOrder=1], mappedDhtNodes=null,
> mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl
> [part=504, val=IgfsBlockKey
> [fileId=c976b6f1561-689b0ba5-6920-4b52-a614-c2360d0acff4, blockId=52879,
> affKey=null, evictExclude=true], hasValBytes=true],
> masks=local=1|owner=1|ready=1|reentry=0|used=0|tx=1|single_
> implicit=0|dht_local=1|near_local=0|removed=0|read=0,
> prevVer=GridCacheVersion [topVer=145310277, order=1533830728268,
> nodeOrder=1], nextVer=GridCacheVersion [topVer=145310277,
> order=1533830728268, nodeOrder=1]]], rmts=null]], flags=2]]], prepared=1,
> locked=false, nodeId=6ed33eb9-2103-402c-afab-a415c8f08f2f,
> locMapped=false,
> expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0,
> serReadVer=null, xidVer=GridCacheVersion [topVer=145310277,
> order=1533830728268, nodeOrder=1]], 

Re: Optimum persistent SQL storage and querying strategy

2018-08-08 Thread Pavel Kovalenko
Hello Jose,

Did you consider Mongo DB for your use case?

2018-08-08 10:13 GMT+03:00 joseheitor :

> Hi Ignite Team,
>
> Any tips and recommendations...?
>
> Jose
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-08-02 Thread Pavel Kovalenko
Hello Ray,

I'm glad that your problem was resolved. I just want to add that on PME
beginning phase we're waiting for all current client operations finishing,
new operations are freezed till PME end. After node finishes all ongoing
client operations it counts down latch that you see in logs which in the
message "Unable to await". When all nodes finish all their operations,
exchange latch completes and PME continues. This latch was added to reach
data consistency on all nodes during main PME phase (partition information
exchange, affinity calculation, etc.). If you have network throttling
between client and server, it becomes hard to notify a client that
his datastreamer operation has finished and latch completing process is
slowed down.

2018-08-02 12:11 GMT+03:00 Ray :

> The root cause for this issue is the network throttle between client and
> servers.
>
> When I move the clients to run in the same cluster as the servers, there's
> no such problem any more.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-07-26 Thread Pavel Kovalenko
Hello Ray,

Without explicit errors in the log, it's not so easy to guess what was
that.
Because I don't see any errors, it should be a recoverable failure (even
taking a long time).
If you have such option, could you please enable DEBUG log level
for org.apache.ignite.internal.util.nio.GridTcpNioCommunicationClient
and org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi for server
nodes?
If such a long time of PME happens, again and again, debug logs from these
classes give us a lot of useful information to find the exact cause of such
a long communication process.

If a client node has a stable connection to the cluster it should wait for
PME till its end. My message about reconnecting was mostly about if client
connection to the cluster breaks.
But, If after the end of PME client doesn't send any data yet, thread dump
from the client will be very useful to analyze why it's happened.

2018-07-26 18:36 GMT+03:00 Ray :

> Hello Pavel,
>
> Thanks for the explanation, it's been great help.
>
> Can you take a guess why PME has performed a long time due to communication
> issues between server nodes?
> From the logs, the "no route to host" exception happened because server
> can't connect to client's ports.
> But I didn't see any logs indicating the network issues between server
> nodes.
> I tested connectivity of communication SPI ports(47100 in this case) and
> discovery SPI ports(49500 in this case) between server nodes, it's all
> good.
>
> And on client(spark executor) side, there's no exception log when PME takes
> a long time to finish.
> It will hang forever.
> Spark.log
> 
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-07-26 Thread Pavel Kovalenko
Hello Ray,

It's hard to say that the issue you mentioned is the cause of your problem.
To determine it, it will be very good if you get thread dumps on next such
network glitch both from server and client nodes (using jstack e.g.).
I'm not aware of Ignite Spark DataFrames implementation features, but in
general, in case of any node join/left Ignite cluster triggers process
named PME (Partitions Map Exchange).
During this process, all write operations to the cluster are frozen until
PME end. In your case, PME has performed a long time due to communication
issues between server nodes (server nodes were unable to send an
acknowledgment to each other to continue PME, while Discovery worked well
which is a bit strange). This can cause why you didn't see updates.
In case of client connection exception, a client node should try to
reconnect to another server node and complete his data write futures with
an exception, so Spark Executor which uses client node to stream data to
Ignite should catch this network exception and perform re-connect and
retry of data batch writing.

For more details about Spark DataFrames implementation in Ignite, you may
ask Nikolay Izhikov (I attached his email as the recipient in this letter).


2018-07-26 5:50 GMT+03:00 Ray :

> Hello Pavel,
>
> Here's the log for for node ids = [429edc2b-eb14-414f-a978-9bfe35443c8c,
> 6783732c-9a13-466f-800a-ad4c8d9be3bf].
> 6783732c-9a13-466f-800a-ad4c8d9be3bf.zip
>  t1346/6783732c-9a13-466f-800a-ad4c8d9be3bf.zip>
> 429edc2b-eb14-414f-a978-9bfe35443c8c.zip
>  t1346/429edc2b-eb14-414f-a978-9bfe35443c8c.zip>
> I examined the logs and looks like there's a network issue here because
> there's a lot of "java.net.NoRouteToHostException: No route to host"
> exception.
>
> I did a little research and found this ticket may be the cause.
> https://issues.apache.org/jira/browse/IGNITE-8739
>
> Will the client(spark executor in this case) retry data insert if I apply
> this patch when the network glitch is resolved?
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-07-25 Thread Pavel Kovalenko
Hello Ray,

According to your attached log, It seems that you have some network
problems. Could you please also share logs from nodes with temporary ids =
[429edc2b-eb14-414f-a978-9bfe35443c8c, 6783732c-9a13-466f-800a-ad4c8d9be3bf].
The root cause should be on those nodes.

2018-07-25 13:03 GMT+03:00 Ray :

> I have a three node Ignite 2.6 cluster setup with the following config.
>
>  class="org.apache.ignite.configuration.IgniteConfiguration">
> 
> 
> 
> 
>  class="org.apache.ignite.configuration.DataStorageConfiguration">
> 
>
> 
> 
> 
>  class="org.apache.ignite.configuration.DataRegionConfiguration">
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
> 
> 
>  class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
>
> 
> 
> node1:49500
> node2:49500
> node3:49500
> 
> 
> 
> 
> 
> 
> 
> 
>  value="config/ignite-log4j2.xml"/>
> 
> 
> 
> 
>
> And I used this command to start Ignite service on three nodes.
>
> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
> -J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCApplicationConcurrentTime
> -J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log
> config/persistent-config.xml
>
> When I'm using Spark dataframe API to ingest data into this cluster, the
> cluster freezes after some time and no new data can be ingested into
> Ignite.
> Both the client(spark executor) and server are showing the "Unable to await
> partitions release latch within timeout: ServerLatch" exception starts from
> line 51834 in full log like this
>
> [2018-07-25T09:45:42,177][WARN
> ][exchange-worker-#162][GridDhtPartitionsExchangeFuture] Unable to await
> partitions release latch within timeout: ServerLatch [permits=2,
> pendingAcks=[429edc2b-eb14-414f-a978-9bfe35443c8c,
> 6783732c-9a13-466f-800a-ad4c8d9be3bf], super=CompletableLatch
> [id=exchange, topVer=AffinityTopologyVersion [topVer=239, minorTopVer=0]]]
>
> Here's the full log on server node having the exception.
> 07-25.zip
> 
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Ignite 2.5 uncatched BufferUnderflowException while reading WAL on startup

2018-06-15 Thread Pavel Kovalenko
David,

No this problem exists in older versions also.

пт, 15 июн. 2018 г. в 17:54, David Harvey :

> Is https://issues.apache.org/jira/browse/IGNITE-8780 a regression in 2.5 ?
>
> On Thu, Jun 14, 2018 at 7:03 AM, Pavel Kovalenko 
> wrote:
>
>> DocDVZ,
>>
>> Most probably you faced with the following issue
>> https://issues.apache.org/jira/browse/IGNITE-8780.
>> You can try to remove END file marker, in this case node will be
>> recovered using WAL.
>>
>> чт, 14 июн. 2018 г. в 12:00, DocDVZ :
>>
>>> As i see, last checkpoint-end file, that invoked the problem, was
>>> created,
>>> but not filled with data:
>>>
>>>
>>> /opt/penguin/ignite/apache-ignite-fabric-2.5.0-bin/work/db/node00-203cc00d-0935-450d-acc9-d59cc3d2163d/cp$
>>> ls -lah
>>> total 208K
>>> drwxr-xr-x 2 penguin penguin 4.0K Jun  9 12:52 .
>>> drwxr-xr-x 3 penguin root4.0K Jun  9 14:27 ..
>>> <...>
>>> -rw-r--r-- 1 penguin penguin   16 Jun  9 12:49
>>> 1528537756580-node-started.bin
>>> -rw-r--r-- 1 penguin penguin0 Jun  9 12:52
>>> 1528537928225-a5e78b9c-26c8-4dc1-b554-2ee35f119f0a-END.bin
>>> -rw-r--r-- 1 penguin penguin   16 Jun  9 12:52
>>> 1528537928225-a5e78b9c-26c8-4dc1-b554-2ee35f119f0a-START.bin
>>>
>>> what state does ignite checkpointing have in that moment? Is it save for
>>> node persistent data to remove that empty file?
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>
>>
>
>
> *Disclaimer*
>
> The information contained in this communication from the sender is
> confidential. It is intended solely for use by the recipient and others
> authorized to receive it. If you are not the recipient, you are hereby
> notified that any disclosure, copying, distribution or taking action in
> relation of the contents of this information is strictly prohibited and may
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by *Mimecast Ltd*, an innovator in Software as a
> Service (SaaS) for business. Providing a *safer* and *more useful* place
> for your human generated data. Specializing in; Security, archiving and
> compliance. To find out more Click Here
> <http://www.mimecast.com/products/>.
>


Re: Ignite 2.5 uncatched BufferUnderflowException while reading WAL on startup

2018-06-14 Thread Pavel Kovalenko
DocDVZ,

Most probably you faced with the following issue
https://issues.apache.org/jira/browse/IGNITE-8780.
You can try to remove END file marker, in this case node will be recovered
using WAL.

чт, 14 июн. 2018 г. в 12:00, DocDVZ :

> As i see, last checkpoint-end file, that invoked the problem, was created,
> but not filled with data:
>
>
> /opt/penguin/ignite/apache-ignite-fabric-2.5.0-bin/work/db/node00-203cc00d-0935-450d-acc9-d59cc3d2163d/cp$
> ls -lah
> total 208K
> drwxr-xr-x 2 penguin penguin 4.0K Jun  9 12:52 .
> drwxr-xr-x 3 penguin root4.0K Jun  9 14:27 ..
> <...>
> -rw-r--r-- 1 penguin penguin   16 Jun  9 12:49
> 1528537756580-node-started.bin
> -rw-r--r-- 1 penguin penguin0 Jun  9 12:52
> 1528537928225-a5e78b9c-26c8-4dc1-b554-2ee35f119f0a-END.bin
> -rw-r--r-- 1 penguin penguin   16 Jun  9 12:52
> 1528537928225-a5e78b9c-26c8-4dc1-b554-2ee35f119f0a-START.bin
>
> what state does ignite checkpointing have in that moment? Is it save for
> node persistent data to remove that empty file?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Ignite 2.5 uncatched BufferUnderflowException while reading WAL on startup

2018-06-09 Thread Pavel Kovalenko
Hello DocDVZ,

What is your hardware environment? Do you use external / network storage
device?

2018-06-09 15:14 GMT+03:00 DocDVZ :

> Raw text blocks were discarded from message:
> Service parameters:
> ignite.sh -J-Xmx6g -J-Xms6g -J-XX:+AlwaysPreTouch -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC
> ${IGNITE_HOME}/config/test-ignite-config.xml
>
> Configurations:
> 
>
> 
>
>
> 
>  class="org.apache.ignite.configuration.DataStorageConfiguration">
> 
>  class="org.apache.ignite.configuration.DataRegionConfiguration">
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  class="org.apache.ignite.configuration.DataRegionConfiguration">
>  value="true"/>
> 
> 
> 
> 
> 
> 
> 
> 
>
> 
> 
>
>  class="org.apache.ignite.configuration.CacheConfiguration">
>  value="CLIENT_PROFILE_CACHE_0001"/>
> 
> 
>  value="READ_ONLY_SAFE"/>
> 
>  value="PRIMARY_SYNC"/>
>  value="penguin-region"/>
> 
>
> 
> 
>
> 
>  class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
> 
> 
>  class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.
> TcpDiscoveryVmIpFinder">
> 
> 
> 127.0.0.1:47000..47500
> 
> 
> 
> 
> 
> 
>
> 
>  class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
> 
> 
> 
> 
>
> Log:
> [14:14:15]__  
> [14:14:15]   /  _/ ___/ |/ /  _/_  __/ __/
> [14:14:15]  _/ // (7 7// /  / / / _/
> [14:14:15] /___/\___/_/|_/___/ /_/ /___/
> [14:14:15]
> [14:14:15] ver. 2.5.0#20180523-sha1:86e110c7
> [14:14:15] 2018 Copyright(C) Apache Software Foundation
> [14:14:15]
> [14:14:15] Ignite documentation: http://ignite.apache.org
> [14:14:15]
> [14:14:15] Quiet mode.
> [14:14:15]   ^-- Logging to file
> '/opt/penguin/ignite/apache-ignite-fabric-2.5.0-bin/work/
> log/ignite-88833562.0.log'
> [14:14:15]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
> [14:14:15]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
> or "-v" to ignite.{sh|bat}
> [14:14:15]
> [14:14:15] OS: Linux 4.9.0-6-amd64 amd64
> [14:14:15] VM information: Java(TM) SE Runtime Environment 1.8.0_171-b11
> Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.171-b11
> [14:14:15] Configured plugins:
> [14:14:15]   ^-- None
> [14:14:15]
> [14:14:15] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
> [tryStop=false, timeout=0]]
> [14:14:15] Message queue limit is set to 0 which may lead to potential
> OOMEs
> when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to
> message queues growth on sender and receiver sides.
> [14:14:15] Security status [authentication=off, tls/ssl=off]
> [14:14:16,448][SEVERE][main][IgniteKernal] Exception during start
> processors, node will be stopped and close connections
> java.nio.BufferUnderflowException
> at java.nio.Buffer.nextGetIndex(Buffer.java:506)
> at java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:412)
> at
> org.apache.ignite.internal.processors.cache.persistence.
> GridCacheDatabaseSharedManager.readPointer(GridCacheDatabaseSharedManager
> .java:1915)
> at
> org.apache.ignite.internal.processors.cache.persistence.
> GridCacheDatabaseSharedManager.readCheckpointStatus(
> GridCacheDatabaseSharedManager.java:1892)
> at
> org.apache.ignite.internal.processors.cache.persistence.
> GridCacheDatabaseSharedManager.readMetastore(
> GridCacheDatabaseSharedManager.java:565)
> at
> org.apache.ignite.internal.processors.cache.persistence.
> GridCacheDatabaseSharedManager.start0(GridCacheDatabaseSharedManager
> .java:525)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedManagerAdapter.
> start(GridCacheSharedManagerAdapter.java:61)
> at
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.start(
> GridCacheProcessor.java:700)
> at
> org.apache.ignite.internal.IgniteKernal.startProcessor(
> IgniteKernal.java:1738)
> at org.apache.ignite.internal.IgniteKernal.start(
> IgniteKernal.java:985)
> at
> 

Re: Apache Ignite application deploy without rebalancing

2018-04-25 Thread Pavel Kovalenko
Hello,

Most probably there is no actual rebalancing started and we fire
REBALANCE_STARTED event ahead of time. Could you please turn on INFO log
level for Ignite classes and check that after node shutdown a message
"Skipping rebalancing" appears in logs?

2018-04-25 7:55 GMT+03:00 moon-duck :

> Unfortunately not work T^T
>
> I did 2way like below
> 1. I set region name in cache configuration -> not work
> 2. I set DefaultDataRegionConfiguration except regionName -> not work
>
> below log is when I set DefaultDataRegionConfiguration except any other
> region
>
> **) when loading*
>
> 2018-04-25 13:32:44.887  WARN 5837 --- [   main]
> o.a.i.s.c.tcp.TcpCommunicationSpi: Message queue limit is set to 0
> which may lead to potential OOMEs when running cache operations in
> FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and
> receiver sides.
> [13:32:44] Message queue limit is set to 0 which may lead to potential
> OOMEs
> when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to
> message queues growth on sender and receiver sides.
> 2018-04-25 13:32:45.076  WARN 5837 --- [   main]
> o.a.i.s.c.noop.NoopCheckpointSpi : Checkpoints are disabled (to
> enable configure any GridCheckpointSpi implementation)
> "/data/log/catalina/localhost/catalina.out" 193L, 26597C
> [13:37:39] Data Regions Configured:
> [13:37:39]   ^-- default [initSize=2.0 GiB, maxSize=2.0 GiB,
> persistenceEnabled=true]
> 2018-04-25 13:37:39.744  INFO 6006 --- [   main]
> s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped
> "{[/ready/deploy/group],methods=[POST]}" onto public java.lang.Object
> cc.platform.expresso.storage.controller.ManageController.
> clusterBaselineTopologying()
> 2018-04-25 13:37:39.745  INFO 6006 --- [   main]
> s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped
> "{[/cluster/activation],methods=[POST]}" onto public void
> cc.platform.expresso.storage.controller.ManageController.activeCluster()
> 2018-04-25 13:37:39.746  INFO 6006 --- [   main]
> s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped
> "{[/version],methods=[GET]}" onto public java.lang.Object
> cc.platform.expresso.storage.controller.ManageController.version()
> 2018-04-25 13:37:39.768  INFO 6006 --- [   main]
> s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped
> "{[/error],produces=[text/html]}" onto public
> org.springframework.web.servlet.ModelAndView
> org.springframework.boot.autoconfigure.web.servlet.
> error.BasicErrorController.errorHtml(javax.servlet.http.
> HttpServletRequest,javax.servlet.http.HttpServletResponse)
> 2018-04-25 13:37:39.782  INFO 6006 --- [   main]
> s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped "{[/error]}" onto public
> org.springframework.http.ResponseEntity java.lang.Object>>
> org.springframework.boot.autoconfigure.web.servlet.
> error.BasicErrorController.error(javax.servlet.http.HttpServletRequest)
> 2018-04-25 13:37:40.035  INFO 6006 --- [   main]
> s.w.s.m.m.a.RequestMappingHandlerAdapter : Looking for @ControllerAdvice:
> org.springframework.boot.web.servlet.context.
> AnnotationConfigServletWebServerApplicationContext@5ecddf8f:
> startup date [Wed Apr 25 13:37:28 KST 2018]; root of context hierarchy
> 2018-04-25 13:37:40.681  INFO 6006 --- [   main]
> o.s.j.d.e.EmbeddedDatabaseFactory: Starting embedded database:
> url='jdbc:h2:mem:testdb;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=false',
> username='sa'
> 2018-04-25 13:37:42.827  INFO 6006 --- [   main]
> o.s.j.e.a.AnnotationMBeanExporter: Registering beans for JMX
> exposure on startup
> 2018-04-25 13:37:42.856  INFO 6006 --- [   main]
> o.s.j.e.a.AnnotationMBeanExporter: Bean with name 'startNode' has
> been autodetected for JMX exposure
> 2018-04-25 13:37:42.864  INFO 6006 --- [   main]
> o.s.j.e.a.AnnotationMBeanExporter: Located MBean 'startNode':
> registering with JMX server as MBean
> [org.apache.ignite.internal:name=startNode,type=IgniteKernal]
> 2018-04-25 13:37:43.261  INFO 6006 --- [   main]
> o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080
> (http) with context path ''
> 2018-04-25 13:37:43.277  INFO 6006 --- [   main]
> c.p.expresso.storage.StorageApplication  : Started StorageApplication in
> 16.334 seconds (JVM running for 18.209)
> [13:37:52] Topology snapshot [ver=2, servers=2, clients=0, CPUs=2,
> offheap=4.0GB, heap=2.0GB]
> [13:37:52] Data Regions Configured:
> [13:37:52]   ^-- default [initSize=2.0 GiB, maxSize=2.0 GiB,
> persistenceEnabled=true]
> 2018-04-25 13:37:52.044  INFO 6006 --- [vent-worker-#35]
> c.p.e.s.c.c.EventListenerConfigurator: Event - time :
> 2018-04-25T13:37:52.039, name : NODE_JOINED, type : 10, msg: Node joined:
> TcpDiscoveryNode [id=6a61cd48-e111-456f-9376-e3198323fa83,
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.16.135.35],
>