[jira] [Commented] (IGNITE-5631) Can't write value greater then wal segment

2021-12-01 Thread Joel Lang (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452013#comment-17452013
 ] 

Joel Lang commented on IGNITE-5631:
---

Is this going to be addressed or fixed in any way? I've run into this issue 
myself and I don't recall any documentation about values being limited by the 
WAL like this.

The problem also occurred during a transaction. So it's not just single values 
but also the combined size of a transaction that cannot exceed the WAL segment 
size.

> Can't write value greater then wal segment
> --
>
> Key: IGNITE-5631
> URL: https://issues.apache.org/jira/browse/IGNITE-5631
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.1
>Reporter: Alexander Belyak
>Priority: Minor
>
> Step to reproduce: insert value greater then wal segment size.
> Expected behavior: get few wal segments and insert value
> Current behavior: infinite writing of wal archive
> For test I use 256Kb of WAL segment size and value from 10M length String.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (IGNITE-12656) Cleanup GridCacheProcessor from functionality not related to its responsibility

2020-02-29 Thread Joel Lang (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048362#comment-17048362
 ] 

Joel Lang commented on IGNITE-12656:


[~slava.koptilin] I think moving the LocalAffinityFunction class would break 
node startup for anyone that is using a local cache with persistence enabled.

 

> Cleanup GridCacheProcessor from functionality not related to its 
> responsibility
> ---
>
> Key: IGNITE-12656
> URL: https://issues.apache.org/jira/browse/IGNITE-12656
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8
>Reporter: Vyacheslav Koptilin
>Assignee: Vyacheslav Koptilin
>Priority: Major
> Fix For: 2.9
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we have a couple of functionality in GridCacheProcessor not 
> directly related to its responsibility, like:
> * initQueryStructuresForNotStartedCache
> * addRemovedItemsCleanupTask
> * setTxOwnerDumpRequestsAllowed
> * longTransactionTimeDumpThreshold
> * transactionTimeDumpSamplesCoefficient
> * longTransactionTimeDumpSamplesPerSecondLimit
> * broadcastToNodesSupportingFeature
> * LocalAffinityFunction
> * RemovedItemsCleanupTask
> * TxTimeoutOnPartitionMapExchangeChangeFuture
> * enableRebalance
> We need to move them to the right places and make GridCacheProcessor code 
> cleaner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-11873) Enabling SQL On-heap Row Cache results in row cache being inconsistent with off-heap storage

2019-05-28 Thread Joel Lang (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-11873:
---
Priority: Major  (was: Minor)

> Enabling SQL On-heap Row Cache results in row cache being inconsistent with 
> off-heap storage
> 
>
> Key: IGNITE-11873
> URL: https://issues.apache.org/jira/browse/IGNITE-11873
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, persistence, sql
>Affects Versions: 2.7
>Reporter: Joel Lang
>Priority: Major
> Attachments: TestSQLBug.java, entry1.png, entry2.png
>
>
> When enabling the SQL On-heap Row Cache feature on a persistent, atomic, 
> replicated cache, I found that after a number of queries and updates, 
> averaging from 40 to 60 updates, the on-heap cache will become inconsistent 
> with the off-heap storage. This manifests on a single, non-clustered Ignite 
> node that I test with.
> Specifically I would query a cache using SQL for a specific entry, but when 
> updating the entry using a normal put() on the cache, the entry would not be 
> changed from the perspective of the next SQL query. This causes the business 
> code to not behave as expected.
> When examining the state of the cache from DBeaver using a select query, I've 
> found that the problem row in question is duplicated in the query results, 
> and out of order despite ordering the results by key:
> !entry1.png!
> Restarting Ignite to clear the on-heap cache reveals the actual row:
> !entry2.png!
> When looking at the state of H2RowCache from a heap dump, I found that there 
> where two different instances of GridH2KeyValueRowOnheap containing two 
> different instances of the cache value in different states: the one I'm 
> seeing and the one I'm trying to update it to.
> As a side effect of all of this, the ModifyingEntryProcessor always fails on 
> that row because "entryVal" is never equal to "val" when checked in the 
> process() method.
> I've attached a file I used to test the issue. That test revealed that it 
> only occurs when both persistence and SQL on-heap cache are enabled. If one 
> or the other is disabled then there is no issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11873) Enabling SQL On-heap Row Cache results in row cache being inconsistent with off-heap storage

2019-05-28 Thread Joel Lang (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-11873:
---
Description: 
When enabling the SQL On-heap Row Cache feature on a persistent, atomic, 
replicated cache, I found that after a number of queries and updates, averaging 
from 40 to 60 updates, the on-heap cache will become inconsistent with the 
off-heap storage. This manifests on a single, non-clustered Ignite node that I 
test with.

Specifically I would query a cache using SQL for a specific entry, but when 
updating the entry using a normal put() on the cache, the entry would not be 
changed from the perspective of the next SQL query. This causes the business 
code to not behave as expected.

When examining the state of the cache from DBeaver using a select query, I've 
found that the problem row in question is duplicated in the query results, and 
out of order despite ordering the results by key:

!entry1.png!

Restarting Ignite to clear the on-heap cache reveals the actual row:

!entry2.png!

When looking at the state of H2RowCache from a heap dump, I found that there 
where two different instances of GridH2KeyValueRowOnheap containing two 
different instances of the cache value in different states: the one I'm seeing 
and the one I'm trying to update it to.

As a side effect of all of this, the ModifyingEntryProcessor always fails on 
that row because "entryVal" is never equal to "val" when checked in the 
process() method.

I've attached a file I used to test the issue. That test revealed that it only 
occurs when both persistence and SQL on-heap cache are enabled. If one or the 
other is disabled then there is no issue.

  was:
When enabling the SQL On-heap Row Cache feature on a persistent, atomic, 
replicated cache, I found that after a number of queries and updates, averaging 
from 40 to 60 updates, the on-heap cache will become inconsistent with the 
off-heap storage. This manifests on a single, non-clustered Ignite node that I 
test with.

Specifically I would query a cache using SQL for a specific entry, but when 
updating the entry using a normal put() on the cache, the entry would not be 
changed from the perspective of the next SQL query. This causes the business 
code to not behave as expected.

When examining the state of the cache from DBeaver using a select query, I've 
found that the problem row in question is duplicated in the query results, and 
out of order despite ordering the results by key:

!entry1.png!

Restarting Ignite to clear the on-heap cache reveals the actual row:

!entry2.png!

When looking at the state of H2RowCache from a heap dump, I found that there 
where two different instances of GridH2KeyValueRowOnheap containing two 
different instances of the cache value in different states: the one I'm seeing 
and the one I'm trying to update it to.

As a side effect of all of this, the ModifyingEntryProcessor always fails on 
that row because "entryVal" is never equal to "val" when checked in the 
process() method.

If more information is needed to reproduce I can try to make a simple example 
next week after the holiday.


Attached a test file to reproduce the issue and updated the description.

The issue only seems to occur when both persistence and the SQL on-heap cache 
are enabled.

> Enabling SQL On-heap Row Cache results in row cache being inconsistent with 
> off-heap storage
> 
>
> Key: IGNITE-11873
> URL: https://issues.apache.org/jira/browse/IGNITE-11873
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, persistence, sql
>Affects Versions: 2.7
>Reporter: Joel Lang
>Priority: Minor
> Attachments: TestSQLBug.java, entry1.png, entry2.png
>
>
> When enabling the SQL On-heap Row Cache feature on a persistent, atomic, 
> replicated cache, I found that after a number of queries and updates, 
> averaging from 40 to 60 updates, the on-heap cache will become inconsistent 
> with the off-heap storage. This manifests on a single, non-clustered Ignite 
> node that I test with.
> Specifically I would query a cache using SQL for a specific entry, but when 
> updating the entry using a normal put() on the cache, the entry would not be 
> changed from the perspective of the next SQL query. This causes the business 
> code to not behave as expected.
> When examining the state of the cache from DBeaver using a select query, I've 
> found that the problem row in question is duplicated in the query results, 
> and out of order despite ordering the results by key:
> !entry1.png!
> Restarting Ignite to clear the on-heap cache reveals the actual row:
> !entry2.png!
> When looking at the state of H2RowCache from a heap dump, I found that there 
> where two different instances of 

[jira] [Updated] (IGNITE-11873) Enabling SQL On-heap Row Cache results in row cache being inconsistent with off-heap storage

2019-05-28 Thread Joel Lang (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-11873:
---
Attachment: TestSQLBug.java

> Enabling SQL On-heap Row Cache results in row cache being inconsistent with 
> off-heap storage
> 
>
> Key: IGNITE-11873
> URL: https://issues.apache.org/jira/browse/IGNITE-11873
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, persistence, sql
>Affects Versions: 2.7
>Reporter: Joel Lang
>Priority: Minor
> Attachments: TestSQLBug.java, entry1.png, entry2.png
>
>
> When enabling the SQL On-heap Row Cache feature on a persistent, atomic, 
> replicated cache, I found that after a number of queries and updates, 
> averaging from 40 to 60 updates, the on-heap cache will become inconsistent 
> with the off-heap storage. This manifests on a single, non-clustered Ignite 
> node that I test with.
> Specifically I would query a cache using SQL for a specific entry, but when 
> updating the entry using a normal put() on the cache, the entry would not be 
> changed from the perspective of the next SQL query. This causes the business 
> code to not behave as expected.
> When examining the state of the cache from DBeaver using a select query, I've 
> found that the problem row in question is duplicated in the query results, 
> and out of order despite ordering the results by key:
> !entry1.png!
> Restarting Ignite to clear the on-heap cache reveals the actual row:
> !entry2.png!
> When looking at the state of H2RowCache from a heap dump, I found that there 
> where two different instances of GridH2KeyValueRowOnheap containing two 
> different instances of the cache value in different states: the one I'm 
> seeing and the one I'm trying to update it to.
> As a side effect of all of this, the ModifyingEntryProcessor always fails on 
> that row because "entryVal" is never equal to "val" when checked in the 
> process() method.
> If more information is needed to reproduce I can try to make a simple example 
> next week after the holiday.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11873) Enabling SQL On-heap Row Cache results in row cache being inconsistent with off-heap storage

2019-05-24 Thread Joel Lang (JIRA)
Joel Lang created IGNITE-11873:
--

 Summary: Enabling SQL On-heap Row Cache results in row cache being 
inconsistent with off-heap storage
 Key: IGNITE-11873
 URL: https://issues.apache.org/jira/browse/IGNITE-11873
 Project: Ignite
  Issue Type: Bug
  Components: cache, persistence, sql
Affects Versions: 2.7
Reporter: Joel Lang
 Attachments: entry1.png, entry2.png

When enabling the SQL On-heap Row Cache feature on a persistent, atomic, 
replicated cache, I found that after a number of queries and updates, averaging 
from 40 to 60 updates, the on-heap cache will become inconsistent with the 
off-heap storage. This manifests on a single, non-clustered Ignite node that I 
test with.

Specifically I would query a cache using SQL for a specific entry, but when 
updating the entry using a normal put() on the cache, the entry would not be 
changed from the perspective of the next SQL query. This causes the business 
code to not behave as expected.

When examining the state of the cache from DBeaver using a select query, I've 
found that the problem row in question is duplicated in the query results, and 
out of order despite ordering the results by key:

!entry1.png!

Restarting Ignite to clear the on-heap cache reveals the actual row:

!entry2.png!

When looking at the state of H2RowCache from a heap dump, I found that there 
where two different instances of GridH2KeyValueRowOnheap containing two 
different instances of the cache value in different states: the one I'm seeing 
and the one I'm trying to update it to.

As a side effect of all of this, the ModifyingEntryProcessor always fails on 
that row because "entryVal" is never equal to "val" when checked in the 
process() method.

If more information is needed to reproduce I can try to make a simple example 
next week after the holiday.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-9031) SpringCacheManager throws AssertionError during Spring initialization

2018-07-19 Thread Joel Lang (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549421#comment-16549421
 ] 

Joel Lang edited comment on IGNITE-9031 at 7/19/18 3:24 PM:


[~aakhmedov] you described the setup I have. Ignite is being started inside of 
a host application which creates more than one {{ApplicationContext}}. This 
isn't Spring MVC.

This host application first creates a {{FileSystemXmlApplicationContext}} using 
root-app-context.xml, which imports ignite.xml.

Later it creates another {{ClassPathXmlApplicationContext}} using 
extensions.xml which uses the previously created context as its parent. This is 
the line in the stack trace when the assertion happens. This should be the root 
cause.

I can post a stripped-down version of the root-app-context.xml and ignite.xml:

*root-app-context.xml:*
{code:xml}

http://www.springframework.org/schema/beans;
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
   xsi:schemaLocation="http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans-4.3.xsd;>




cluster.properties









{code}
*ignite.xml:*
{code:xml}

http://www.springframework.org/schema/beans;
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
   xmlns:util="http://www.springframework.org/schema/util;
   xsi:schemaLocation="
   http://www.springframework.org/schema/beans
   http://www.springframework.org/schema/beans/spring-beans-4.3.xsd
   http://www.springframework.org/schema/util
   http://www.springframework.org/schema/util/spring-util-4.3.xsd;>














 
 
 
















{code}


was (Author: langj):
[~aakhmedov] you described the setup I have. Ignite is being started inside of 
a host application which creates more than one {{ApplicationContext}}. This 
isn't Spring MVC.

This host application first creates a {{FileSystemXmlApplicationContext}} using 
root-app-context.xml, which imports ignite.xml.

Later it creates another {{ClassPathXmlApplicationContext}} using 
extensions.xml which uses the previously created context as its parent. This is 
the line in the stack trace when the assertion happens. This would be the root 
cause.

I can post a stripped-down version of the root-app-context.xml and ignite.xml:

*root-app-context.xml:*
{code:xml}

http://www.springframework.org/schema/beans;
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
   xsi:schemaLocation="http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans-4.3.xsd;>




cluster.properties









{code}
*ignite.xml:*
{code:xml}

http://www.springframework.org/schema/beans;
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
   xmlns:util="http://www.springframework.org/schema/util;
   xsi:schemaLocation="
   http://www.springframework.org/schema/beans
   http://www.springframework.org/schema/beans/spring-beans-4.3.xsd
   http://www.springframework.org/schema/util
   http://www.springframework.org/schema/util/spring-util-4.3.xsd;>














 
 
 
















{code}

> SpringCacheManager throws AssertionError during Spring initialization
> -
>
> Key: IGNITE-9031
> URL: https://issues.apache.org/jira/browse/IGNITE-9031
> Project: Ignite
>  Issue Type: Bug
>  Components: spring
>Affects Versions: 2.6
>Reporter: Joel Lang
>Assignee: Amir Akhmedov
>Priority: Major
>
> When initializing Ignite using an IgniteSpringBean and also having a 
> SpringCacheManager defined, the SpringCacheManager throws an AssertionError 
> in the onApplicationEvent() method due to it being called more than once.
> There is an "assert ignite == null" that fails after the first call.
> This is related to the changes in IGNITE-8740. 

[jira] [Commented] (IGNITE-9031) SpringCacheManager throws AssertionError during Spring initialization

2018-07-19 Thread Joel Lang (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549421#comment-16549421
 ] 

Joel Lang commented on IGNITE-9031:
---

[~aakhmedov] you described the setup I have. Ignite is being started inside of 
a host application which creates more than one {{ApplicationContext}}. This 
isn't Spring MVC.

This host application first creates a {{FileSystemXmlApplicationContext}} using 
root-app-context.xml, which imports ignite.xml.

Later it creates another {{ClassPathXmlApplicationContext}} using 
extensions.xml which uses the previously created context as its parent. This is 
the line in the stack trace when the assertion happens. This would be the root 
cause.

I can post a stripped-down version of the root-app-context.xml and ignite.xml:

*root-app-context.xml:*
{code:xml}

http://www.springframework.org/schema/beans;
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
   xsi:schemaLocation="http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans-4.3.xsd;>




cluster.properties









{code}
*ignite.xml:*
{code:xml}

http://www.springframework.org/schema/beans;
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
   xmlns:util="http://www.springframework.org/schema/util;
   xsi:schemaLocation="
   http://www.springframework.org/schema/beans
   http://www.springframework.org/schema/beans/spring-beans-4.3.xsd
   http://www.springframework.org/schema/util
   http://www.springframework.org/schema/util/spring-util-4.3.xsd;>














 
 
 
















{code}

> SpringCacheManager throws AssertionError during Spring initialization
> -
>
> Key: IGNITE-9031
> URL: https://issues.apache.org/jira/browse/IGNITE-9031
> Project: Ignite
>  Issue Type: Bug
>  Components: spring
>Affects Versions: 2.6
>Reporter: Joel Lang
>Assignee: Amir Akhmedov
>Priority: Major
>
> When initializing Ignite using an IgniteSpringBean and also having a 
> SpringCacheManager defined, the SpringCacheManager throws an AssertionError 
> in the onApplicationEvent() method due to it being called more than once.
> There is an "assert ignite == null" that fails after the first call.
> This is related to the changes in IGNITE-8740. This happened immediately when 
> I first tried to start Ignite after upgrading from 2.5 to 2.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9031) SpringCacheManager throws AssertionError during Spring initialization

2018-07-18 Thread Joel Lang (JIRA)
Joel Lang created IGNITE-9031:
-

 Summary: SpringCacheManager throws AssertionError during Spring 
initialization
 Key: IGNITE-9031
 URL: https://issues.apache.org/jira/browse/IGNITE-9031
 Project: Ignite
  Issue Type: Bug
  Components: spring
Affects Versions: 2.6
Reporter: Joel Lang


When initializing Ignite using an IgniteSpringBean and also having a 
SpringCacheManager defined, the SpringCacheManager throws an AssertionError in 
the onApplicationEvent() method due to it being called more than once.

There is an "assert ignite == null" that fails after the first call.

This is related to the changes in IGNITE-8740. This happened immediately when I 
first tried to start Ignite after upgrading from 2.5 to 2.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-4150) B-Tree index cannot be used efficiently with IN clause.

2018-05-17 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479715#comment-16479715
 ] 

Joel Lang commented on IGNITE-4150:
---

[~vozerov] this expected fix version for this has been changed many times, do 
you think this can make it into Ignite 2.6?

> B-Tree index cannot be used efficiently with IN clause.
> ---
>
> Key: IGNITE-4150
> URL: https://issues.apache.org/jira/browse/IGNITE-4150
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Affects Versions: 1.7
>Reporter: Vladimir Ozerov
>Assignee: Vladimir Ozerov
>Priority: Major
>  Labels: performance
> Fix For: 2.6
>
>
> Consider the following query:
> {code}
> SELECT * FROM table
> WHERE a = ? AND b IN (?, ?)
> {code}
> If there is an index {{(a, b)}}, it will not be used properly: only column 
> {{a}} will be used. This will leads to multiple unnecessary comparisons.
> Most obvious way to fix that - use temporary table and {{JOIN}}. However, 
> this approach doesn't work well when there are multiple {{IN}}'s. 
> Proper solution would be to hack deeper into H2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8385) SQL: Allow variable-length values in index leafs

2018-05-03 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462846#comment-16462846
 ] 

Joel Lang commented on IGNITE-8385:
---

I spent some time looking at the source code and debugging to figure out how 
things worked.

A surprising issue I found is that if the indexed field is a timestamp (16 
bytes), and the cache key is a string, then inline comparison doesn't happen at 
all. The computeInlineSize() method defaults to using 
IGNITE_MAX_INDEX_PAYLOAD_SIZE_DEFAULT which is 10. A data page lookup always 
happens then because partial comparison isn't possible. [~vozerov] could you 
confirm if this is intended behavior? I did not see any documentation about it 
and how critical the inline size seems to be to performance here.

I'm fixing this by setting inlineSize on the QuerySqlField annotation to 17 for 
the timestamp field, which also makes it ignore the type of the _KEY column.

I also see that partial string comparison can still happen inline even if the 
whole string wouldn't fit into the inline size. That's good then.

> SQL: Allow variable-length values in index leafs
> 
>
> Key: IGNITE-8385
> URL: https://issues.apache.org/jira/browse/IGNITE-8385
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Affects Versions: 2.4
>Reporter: Vladimir Ozerov
>Priority: Major
>  Labels: iep-19, performance
> Fix For: 2.6
>
>
> Currently we have a restriction that every entry inside a BTree leaf should 
> be of fixed size. This restriction is artificial and prevents efficient index 
> usage because we have to choose so-called {{inline size}} for every index 
> manually. This is OK for fixed-size numeric types. But this could be a 
> problem for varlen types such as {{VARCHAR}} because in some cases we cannot 
> fit the whole value and have to fallback to data page lookup. In other cases 
> we may pick too pessimistic inline size value and index space would be 
> wasted. 
> What we need to do is to allow arbitrary item size in index pages. In this 
> case we would be able to inline all necessary values into index pages in most 
> cases. 
> Please pay attention that we may still met page overflow in case too long 
> data types are used. To mitigate this we should:
> 1) Implement IGNITE-6055 first so that we can distinguish between limited and 
> unlimited data types.
> 2) Unlimited data types should be inlined only partially
> 3) We need to have special handling for too long rows (probably just re-use 
> existing logic with minimal adjustments)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8385) SQL: Allow variable-length values in index leafs

2018-05-02 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461844#comment-16461844
 ] 

Joel Lang commented on IGNITE-8385:
---

I did not realize there were situations where the actual indexed value would 
not be stored in the index itself. This might explain a great deal of the 
performance issue I had when using random UUID string indexes with a HDD. While 
the string contains a UUID that is not an actual requirement, so I can't change 
the data type to use the java UUID class, yet.

After looking through the code I see CacheConfiguration has the max inline size 
option that I'm going to try raising to 36 to fit a complete UUID string.

> SQL: Allow variable-length values in index leafs
> 
>
> Key: IGNITE-8385
> URL: https://issues.apache.org/jira/browse/IGNITE-8385
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Affects Versions: 2.4
>Reporter: Vladimir Ozerov
>Priority: Major
>  Labels: iep-19, performance
> Fix For: 2.6
>
>
> Currently we have a restriction that every entry inside a BTree leaf should 
> be of fixed size. This restriction is artificial and prevents efficient index 
> usage because we have to choose so-called {{inline size}} for every index 
> manually. This is OK for fixed-size numeric types. But this could be a 
> problem for varlen types such as {{VARCHAR}} because in some cases we cannot 
> fit the whole value and have to fallback to data page lookup. In other cases 
> we may pick too pessimistic inline size value and index space would be 
> wasted. 
> What we need to do is to allow arbitrary item size in index pages. In this 
> case we would be able to inline all necessary values into index pages in most 
> cases. 
> Please pay attention that we may still met page overflow in case too long 
> data types are used. To mitigate this we should:
> 1) Implement IGNITE-6055 first so that we can distinguish between limited and 
> unlimited data types.
> 2) Unlimited data types should be inlined only partially
> 3) We need to have special handling for too long rows (probably just re-use 
> existing logic with minimal adjustments)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8384) SQL: Secondary indexes should sort entries by links rather than keys

2018-05-01 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459550#comment-16459550
 ] 

Joel Lang commented on IGNITE-8384:
---

How much of an impact would this have on the amount of page reads necessary to 
add a new entry to the index?

> SQL: Secondary indexes should sort entries by links rather than keys
> 
>
> Key: IGNITE-8384
> URL: https://issues.apache.org/jira/browse/IGNITE-8384
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Affects Versions: 2.4
>Reporter: Vladimir Ozerov
>Priority: Major
>  Labels: iep-19, performance
> Fix For: 2.6
>
>
> Currently we sort entries in secondary indexes as {{(idx_cols, KEY)}}. The 
> key itself is not stored in the index in general case. It means that we need 
> to perform a lookup to data page to find correct insertion point for index 
> entry.
> This could be fixed easily by sorting entries a bit differently - {{idx_cols, 
> link}}. This is all we need.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8359) Severe performance degradation with persistence and data streaming on HDD

2018-05-01 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459546#comment-16459546
 ] 

Joel Lang commented on IGNITE-8359:
---

So using the 2.5 nightly build from April 27th, I ran the code to generate the 
6,000,000 entries for cache A then 12,000,000 entries for cache B using the 
data streamer for each. This again in a linux VM operating on a HDD.

This was started on Friday before I left work. When I came into the office I 
found that it had not even finished the operation. It was about 95-97% done. 
The fact that it didn't finish over such a long period of time is a bit obscene.

> Severe performance degradation with persistence and data streaming on HDD
> -
>
> Key: IGNITE-8359
> URL: https://issues.apache.org/jira/browse/IGNITE-8359
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, persistence, sql, streaming
>Affects Versions: 2.4, 2.5
> Environment: Linux CentOS 7 VM using Ignite DirectIO plugin with HDD.
>Reporter: Joel Lang
>Priority: Major
>
> I am testing the use of Ignite's native persistence to store a data set long 
> term. This is on a 2.5 nightly build. To do this I am using Ignite's data 
> streamers to stream in 6,000,000 entries into cache A, and 12,000,000 entries 
> into cache B to simulate the upper limit for 2 years worth of data.
> The test ran smoothly on my personal machine which has a SSD running Windows, 
> but ran into tremendous issues on a development test machine which is a Linux 
> VM using a HDD. I realize when looking at Ignite documentation that it 
> specifically excludes HDD's as something to base a persistent store on, but 
> perhaps my experience could yield improvements for SSD performance too.
> The root issue is that cache updates over time become severely bottlenecked 
> by reading SQL index pages from disk in order to update the index. If I had 
> to guess this would be related to BPlusTree.findInsertionPoint() and it 
> having to load pages from disk if they've been evicted.
> I used a 2.5 nightly build because 2.3 and 2.4 have the same issue where this 
> whole process was further bottlenecked by a lock behind held by Ignite while 
> it read the page from disk in PageMemoryImpl.acquirePage(). 2.5 fixed this.
> The performance issue was much more severe in the previously mentioned cache 
> B, which contains user comments on entries in cache A. The key for each 
> comment entry is a Java class containing the creation timestamp and the 
> string key of the owning entry in cache A. This owning entry key is indexed 
> so comments can be queried by their owner. In this test case there were two 
> comments in cache B for every entry in cache A.
> I found that even 25% of the way through streaming data into cache B, it 
> would take anywhere from 15 to 35 seconds to insert a batch of 2000 comments. 
> This slowed streaming to a crawl and ensures that streaming would need to 
> continue overnight to have any hope of finishing.
> This also brings up concerns about data rebalancing which will have the same 
> performance penalty and similarly take a day at least to rebalance both 
> caches.
> I am worried about the dependency on a large amount of disk reads being done 
> to update the index, even though it is considerably faster with an SSD than 
> without. I've also not been able to test whether performance for an SSD will 
> be different when running in a VM, which is another worry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-8359) Severe performance degradation with persistence and data streaming on HDD

2018-04-23 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-8359:
--
Description: 
I am testing the use of Ignite's native persistence to store a data set long 
term. This is on a 2.5 nightly build. To do this I am using Ignite's data 
streamers to stream in 6,000,000 entries into cache A, and 12,000,000 entries 
into cache B to simulate the upper limit for 2 years worth of data.

The test ran smoothly on my personal machine which has a SSD running Windows, 
but ran into tremendous issues on a development test machine which is a Linux 
VM using a HDD. I realize when looking at Ignite documentation that it 
specifically excludes HDD's as something to base a persistent store on, but 
perhaps my experience could yield improvements for SSD performance too.

The root issue is that cache updates over time become severely bottlenecked by 
reading SQL index pages from disk in order to update the index. If I had to 
guess this would be related to BPlusTree.findInsertionPoint() and it having to 
load pages from disk if they've been evicted.

I used a 2.5 nightly build because 2.3 and 2.4 have the same issue where this 
whole process was further bottlenecked by a lock behind held by Ignite while it 
read the page from disk in PageMemoryImpl.acquirePage(). 2.5 fixed this.

The performance issue was much more severe in the previously mentioned cache B, 
which contains user comments on entries in cache A. The key for each comment 
entry is a Java class containing the creation timestamp and the string key of 
the owning entry in cache A. This owning entry key is indexed so comments can 
be queried by their owner. In this test case there were two comments in cache B 
for every entry in cache A.

I found that even 25% of the way through streaming data into cache B, it would 
take anywhere from 15 to 35 seconds to insert a batch of 2000 comments. This 
slowed streaming to a crawl and ensures that streaming would need to continue 
overnight to have any hope of finishing.

This also brings up concerns about data rebalancing which will have the same 
performance penalty and similarly take a day at least to rebalance both caches.

I am worried about the dependency on a large amount of disk reads being done to 
update the index, even though it is considerably faster with an SSD than 
without. I've also not been able to test whether performance for an SSD will be 
different when running in a VM, which is another worry.

  was:
I am testing the use of Ignite's native persistence to store a data set long 
term. This is on a 2.5 nightly build. To do this I am using Ignite's data 
streamers to stream in 6,000,000 entries into cache A, and 12,000,000 entries 
into cache B to simulate the upper limit for 2 years worth of data.

The test ran smoothly on my personal machine which has a SSD running Windows, 
but ran into tremendous issues on a development test machine which is a Linux 
VM using a HDD. I realize when looking at Ignite documentation that it 
specifically excludes HDD's as something to base a persistent store on, but 
perhaps my experience could yield improvements for SSD performance too.

The root issue is that cache updates over time become severely bottlenecked by 
reading SQL index pages from disk in order to update the index. If I had to 
guess this would be related to BPlusTree.findInsertionPoint() and it having to 
load pages from disk if they've been evicted.

I used a 2.5 nightly build because 2.3 and 2.4 have the same issue where this 
whole process was further bottlenecked by a lock behind held by Ignite while it 
read the page from disk in PageMemoryImpl.acquirePage(). 2.5 fixed this.

The performance issue was much more severe in the previously mentioned cache B, 
which contains user comments on entries in cache A. The key for each comment 
entry is a Java class containing the creation timestamp and the string key of 
the owning entry in cache A. This owning entry key is indexed so comments can 
be queried by their owner. In this test case there were two comments in cache B 
for every entry in cache A.

I found that even 25% of the way through streaming data into cache B, it would 
take anywhere from 15 to 35 seconds to insert a batch of 2000 comments. This 
slowed streaming to a crawl and ensures that streaming would need to continue 
overnight to have any hope of finishing.

This also brings up concerns about data rebalancing which will have the same 
performance penalty and similarly take a day at least to rebalance both caches.

I am worried about the dependency on a large amount of disk reads being done to 
update the index, even though it is considerably faster with an SSD than 
without.


> Severe performance degradation with persistence and data streaming on HDD
> -
>
> Key: IGNITE-8359
>  

[jira] [Created] (IGNITE-8359) Severe performance degradation with persistence and data streaming on HDD

2018-04-23 Thread Joel Lang (JIRA)
Joel Lang created IGNITE-8359:
-

 Summary: Severe performance degradation with persistence and data 
streaming on HDD
 Key: IGNITE-8359
 URL: https://issues.apache.org/jira/browse/IGNITE-8359
 Project: Ignite
  Issue Type: Bug
  Components: cache, persistence, sql, streaming
Affects Versions: 2.4, 2.5
 Environment: Linux CentOS 7 VM using Ignite DirectIO plugin with HDD.
Reporter: Joel Lang


I am testing the use of Ignite's native persistence to store a data set long 
term. This is on a 2.5 nightly build. To do this I am using Ignite's data 
streamers to stream in 6,000,000 entries into cache A, and 12,000,000 entries 
into cache B to simulate the upper limit for 2 years worth of data.

The test ran smoothly on my personal machine which has a SSD running Windows, 
but ran into tremendous issues on a development test machine which is a Linux 
VM using a HDD. I realize when looking at Ignite documentation that it 
specifically excludes HDD's as something to base a persistent store on, but 
perhaps my experience could yield improvements for SSD performance too.

The root issue is that cache updates over time become severely bottlenecked by 
reading SQL index pages from disk in order to update the index. If I had to 
guess this would be related to BPlusTree.findInsertionPoint() and it having to 
load pages from disk if they've been evicted.

I used a 2.5 nightly build because 2.3 and 2.4 have the same issue where this 
whole process was further bottlenecked by a lock behind held by Ignite while it 
read the page from disk in PageMemoryImpl.acquirePage(). 2.5 fixed this.

The performance issue was much more severe in the previously mentioned cache B, 
which contains user comments on entries in cache A. The key for each comment 
entry is a Java class containing the creation timestamp and the string key of 
the owning entry in cache A. This owning entry key is indexed so comments can 
be queried by their owner. In this test case there were two comments in cache B 
for every entry in cache A.

I found that even 25% of the way through streaming data into cache B, it would 
take anywhere from 15 to 35 seconds to insert a batch of 2000 comments. This 
slowed streaming to a crawl and ensures that streaming would need to continue 
overnight to have any hope of finishing.

This also brings up concerns about data rebalancing which will have the same 
performance penalty and similarly take a day at least to rebalance both caches.

I am worried about the dependency on a large amount of disk reads being done to 
update the index, even though it is considerably faster with an SSD than 
without.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-7812) Slow rebalancing in case of enabled persistence

2018-04-23 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448250#comment-16448250
 ] 

Joel Lang commented on IGNITE-7812:
---

It seems my optimism was premature. Even with testing with a 2.5 nightly build, 
disk reads as a result of index updates still become the performance bottleneck.

> Slow rebalancing in case of enabled persistence
> ---
>
> Key: IGNITE-7812
> URL: https://issues.apache.org/jira/browse/IGNITE-7812
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>
> A user reported that rebalancing take significantly larger amounts of time 
> when persistence is enabled even in LOG_ONLY mode.
> Need to investigate how the performance of rebalancing may be increased.
> Also, it would be great to estimate the benefit of file transfer for 
> rebalancing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-7812) Slow rebalancing in case of enabled persistence

2018-04-07 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429252#comment-16429252
 ] 

Joel Lang edited comment on IGNITE-7812 at 4/7/18 6:02 AM:
---

I've found that this happens after page eviction begins for the cache. It seems 
to be related to file reads in PageMemoryImpl.acquirePage(), primarily as a 
result of the H2 query index being updated when new rows are received during 
rebalancing. The segment write lock is held during these read operations which 
dramatically slows down other operations. It was a crippling slowdown in my 
test case (50x or more) which was done on a development VM which uses a HDD, 
not a SSD. We would be looking at 1-2 days of rebalance time for a cache with 6 
million entries and 2 indexes in t hat case.

Fortunately, it seems as if this has already been fixed for 2.5, the disk read 
happens after the segment write lock is released now: 
[PageMemoryImpl.java#L750|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/pagemem/PageMemoryImpl.java#L750]


was (Author: langj):
I've found that this happens after page eviction begins for the cache. It seems 
to be related to file reads in PageMemoryImpl.acquirePage(), primarily as a 
result of the H2 query index being updated when new rows are received during 
rebalancing. The segment write lock is held during these read operations which 
dramatically slows down other operations. It was a crippling slowdown in my 
test case which was done on a development VM which uses a HDD, not a SSD. We 
would be looking at 1-2 days of rebalance time for a cache with 6 million 
entries and 2 indexes in t hat case.

Fortunately, it seems as if this has already been fixed for 2.5, the disk read 
happens after the segment write lock is released now: 
[PageMemoryImpl.java#L750|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/pagemem/PageMemoryImpl.java#L750]

> Slow rebalancing in case of enabled persistence
> ---
>
> Key: IGNITE-7812
> URL: https://issues.apache.org/jira/browse/IGNITE-7812
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>
> A user reported that rebalancing take significantly larger amounts of time 
> when persistence is enabled even in LOG_ONLY mode.
> Need to investigate how the performance of rebalancing may be increased.
> Also, it would be great to estimate the benefit of file transfer for 
> rebalancing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-7812) Slow rebalancing in case of enabled persistence

2018-04-07 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429252#comment-16429252
 ] 

Joel Lang edited comment on IGNITE-7812 at 4/7/18 6:02 AM:
---

I've found that this happens after page eviction begins for the cache. It seems 
to be related to file reads in PageMemoryImpl.acquirePage(), primarily as a 
result of the H2 query index being updated when new rows are received during 
rebalancing. The segment write lock is held during these read operations which 
dramatically slows down other operations. It was a crippling slowdown in my 
test case (50x or more) which was done on a development VM which uses a HDD, 
not a SSD. We would be looking at 1-2 days of rebalance time for a cache with 6 
million entries and 2 indexes in that case. While we do plan to use a SSD for 
production, this is still an excessive slowdown.

Fortunately, it seems as if this has already been fixed for 2.5, the disk read 
happens after the segment write lock is released now: 
[PageMemoryImpl.java#L750|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/pagemem/PageMemoryImpl.java#L750]


was (Author: langj):
I've found that this happens after page eviction begins for the cache. It seems 
to be related to file reads in PageMemoryImpl.acquirePage(), primarily as a 
result of the H2 query index being updated when new rows are received during 
rebalancing. The segment write lock is held during these read operations which 
dramatically slows down other operations. It was a crippling slowdown in my 
test case (50x or more) which was done on a development VM which uses a HDD, 
not a SSD. We would be looking at 1-2 days of rebalance time for a cache with 6 
million entries and 2 indexes in t hat case.

Fortunately, it seems as if this has already been fixed for 2.5, the disk read 
happens after the segment write lock is released now: 
[PageMemoryImpl.java#L750|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/pagemem/PageMemoryImpl.java#L750]

> Slow rebalancing in case of enabled persistence
> ---
>
> Key: IGNITE-7812
> URL: https://issues.apache.org/jira/browse/IGNITE-7812
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>
> A user reported that rebalancing take significantly larger amounts of time 
> when persistence is enabled even in LOG_ONLY mode.
> Need to investigate how the performance of rebalancing may be increased.
> Also, it would be great to estimate the benefit of file transfer for 
> rebalancing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-7812) Slow rebalancing in case of enabled persistence

2018-04-07 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429252#comment-16429252
 ] 

Joel Lang edited comment on IGNITE-7812 at 4/7/18 6:01 AM:
---

I've found that this happens after page eviction begins for the cache. It seems 
to be related to file reads in PageMemoryImpl.acquirePage(), primarily as a 
result of the H2 query index being updated when new rows are received during 
rebalancing. The segment write lock is held during these read operations which 
dramatically slows down other operations. It was a crippling slowdown in my 
test case which was done on a development VM which uses a HDD, not a SSD. We 
would be looking at 1-2 days of rebalance time for a cache with 6 million 
entries and 2 indexes in t hat case.

Fortunately, it seems as if this has already been fixed for 2.5, the disk read 
happens after the segment write lock is released now: 
[PageMemoryImpl.java#L750|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/pagemem/PageMemoryImpl.java#L750]


was (Author: langj):
I've found that this happens after page eviction begins for the cache. It seems 
to be related to file reads in PageMemoryImpl.acquirePage() as a result of the 
H2 query index being updated when new rows are received during rebalancing. The 
segment write lock is held during these read operations which dramatically 
slows down other operations. It was a crippling slowdown in my test case which 
was done on a development VM which uses a HDD, not a SSD. We would be looking 
at 1-2 days of rebalance time for a cache with 6 million entries and 2 indexes 
in t hat case.

Fortunately, it seems as if this has already been fixed for 2.5, the disk read 
happens after the segment write lock is released now: 
[PageMemoryImpl.java#L750|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/pagemem/PageMemoryImpl.java#L750]

> Slow rebalancing in case of enabled persistence
> ---
>
> Key: IGNITE-7812
> URL: https://issues.apache.org/jira/browse/IGNITE-7812
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>
> A user reported that rebalancing take significantly larger amounts of time 
> when persistence is enabled even in LOG_ONLY mode.
> Need to investigate how the performance of rebalancing may be increased.
> Also, it would be great to estimate the benefit of file transfer for 
> rebalancing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-7812) Slow rebalancing in case of enabled persistence

2018-04-07 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429252#comment-16429252
 ] 

Joel Lang commented on IGNITE-7812:
---

I've found that this happens after page eviction begins for the cache. It seems 
to be related to file reads in PageMemoryImpl.acquirePage() as a result of the 
H2 query index being updated when new rows are received during rebalancing. The 
segment write lock is held during these read operations which dramatically 
slows down other operations. It was a crippling slowdown in my test case which 
was done on a development VM which uses a HDD, not a SSD. We would be looking 
at 1-2 days of rebalance time for a cache with 6 million entries and 2 indexes 
in t hat case.

Fortunately, it seems as if this has already been fixed for 2.5, the disk read 
happens after the segment write lock is released now: 
[PageMemoryImpl.java#L750|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/pagemem/PageMemoryImpl.java#L750]

> Slow rebalancing in case of enabled persistence
> ---
>
> Key: IGNITE-7812
> URL: https://issues.apache.org/jira/browse/IGNITE-7812
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>
> A user reported that rebalancing take significantly larger amounts of time 
> when persistence is enabled even in LOG_ONLY mode.
> Need to investigate how the performance of rebalancing may be increased.
> Also, it would be great to estimate the benefit of file transfer for 
> rebalancing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6113) Partition eviction prevents exchange from completion

2018-04-05 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427051#comment-16427051
 ] 

Joel Lang commented on IGNITE-6113:
---

Is there a workaround for this issue?

> Partition eviction prevents exchange from completion
> 
>
> Key: IGNITE-6113
> URL: https://issues.apache.org/jira/browse/IGNITE-6113
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, persistence
>Affects Versions: 2.1
>Reporter: Vladislav Pyatkov
>Assignee: Alexey Goncharuk
>Priority: Major
> Fix For: 2.5
>
>
> I has waited for 3 hours for completion without any success.
> exchange-worker is blocked.
> {noformat}
> "exchange-worker-#92%DPL_GRID%grid554.ca.sbrf.ru%" #173 prio=5 os_prio=0 
> tid=0x7f0835c2e000 nid=0xb907 runnable [0x7e74ab1d]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7efee630a7c0> (a 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition$1)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:189)
> at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:139)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.assign(GridDhtPreloader.java:340)
> at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1801)
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
> - None
> {noformat}
> {noformat}
> "sys-#124%DPL_GRID%grid554.ca.sbrf.ru%" #278 prio=5 os_prio=0 
> tid=0x7e731c02d000 nid=0xbf4d runnable [0x7e734e7f7000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:51)
> at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
> - locked <0x7f056161bf88> (a java.lang.Object)
> at 
> org.gridgain.grid.cache.db.wal.FileWriteAheadLogManager$FileWriteHandle.writeBuffer(FileWriteAheadLogManager.java:1829)
> at 
> org.gridgain.grid.cache.db.wal.FileWriteAheadLogManager$FileWriteHandle.flush(FileWriteAheadLogManager.java:1572)
> at 
> org.gridgain.grid.cache.db.wal.FileWriteAheadLogManager$FileWriteHandle.addRecord(FileWriteAheadLogManager.java:1421)
> at 
> org.gridgain.grid.cache.db.wal.FileWriteAheadLogManager$FileWriteHandle.access$800(FileWriteAheadLogManager.java:1331)
> at 
> org.gridgain.grid.cache.db.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:339)
> at 
> org.gridgain.grid.internal.processors.cache.database.pagemem.PageMemoryImpl.beforeReleaseWrite(PageMemoryImpl.java:1287)
> at 
> org.gridgain.grid.internal.processors.cache.database.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1142)
> at 
> org.gridgain.grid.internal.processors.cache.database.pagemem.PageImpl.releaseWrite(PageImpl.java:167)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.util.PageHandler.writeUnlock(PageHandler.java:193)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.util.PageHandler.writePage(PageHandler.java:242)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.util.PageHandler.writePage(PageHandler.java:119)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.BPlusTree$Remove.doRemoveFromLeaf(BPlusTree.java:2886)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.BPlusTree$Remove.removeFromLeaf(BPlusTree.java:2865)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.BPlusTree$Remove.access$6900(BPlusTree.java:2515)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.BPlusTree.removeDown(BPlusTree.java:1607)
> at 
> org.apache.ignite.internal.processors.cache.database.tree.BPlusTree.removeDown(BPlusTree.java:1574)
> at 
> 

[jira] [Created] (IGNITE-7081) Increase in partition count silently breaks persistent cache reads and writes

2017-11-30 Thread Joel Lang (JIRA)
Joel Lang created IGNITE-7081:
-

 Summary: Increase in partition count silently breaks persistent 
cache reads and writes
 Key: IGNITE-7081
 URL: https://issues.apache.org/jira/browse/IGNITE-7081
 Project: Ignite
  Issue Type: Bug
  Components: persistence
Affects Versions: 2.3
Reporter: Joel Lang
Priority: Minor


An increase in the partition count for a cache to even out distribution between 
nodes lead to bad, inconsistent behavior in the cache.

Gets on known keys would return null due to the partition number being 
different even as SQL queries would still find the cache entry through its own 
means.

Removals of these cache entries using SQL would also fail.

It took several hours to track down the issue because of the inconsistency of 
the behavior between SQL queries and a call to get().

Changing the partition count would not have been an issue before we used native 
persistence but now it is.

I believe the solution is to have a more rigid verification of the stored cache 
configuration against the live cache configuration when the cache starts. It 
should fail if any configuration changes are made that would cause problems. It 
also makes me wonder what other changes are safe or not to make to a cache 
configuration that is persistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (IGNITE-6981) System thread pool continuously creates and destroys threads while idle

2017-11-21 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-6981:
--
Description: 
I've observed using VisualVM that Ignite is continuously starting and stopping 
system pool threads even while the pool is idle.

I've attached a screenshot. Notice the high thread number on the left.

!threads.jpg|thumbnail!

  was:
I've observed using VisualVM that Ignite is continuously starting and stopping 
system pool threads even while the pool is idle.

I've attached a screenshot. Notice the high thread number on the left.


> System thread pool continuously creates and destroys threads while idle
> ---
>
> Key: IGNITE-6981
> URL: https://issues.apache.org/jira/browse/IGNITE-6981
> Project: Ignite
>  Issue Type: Bug
>  Components: 2.3, general
>Affects Versions: 2.3
>Reporter: Joel Lang
>Priority: Minor
> Attachments: threads.jpg
>
>
> I've observed using VisualVM that Ignite is continuously starting and 
> stopping system pool threads even while the pool is idle.
> I've attached a screenshot. Notice the high thread number on the left.
> !threads.jpg|thumbnail!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (IGNITE-6981) System thread pool continuously creates and destroys threads while idle

2017-11-21 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-6981:
--
Description: 
I've observed using VisualVM that Ignite is continuously starting and stopping 
system pool threads even while the pool is idle.

I've attached a screenshot. Notice the high thread number on the left.

  was:
I've observed using VisualVM that Ignite is continuously starting and stopping 
system pool threads even while the pool is idle.

I've attached a screenshot. Notice the high thread number on the left.

!threads.jpg|thumbnail!


> System thread pool continuously creates and destroys threads while idle
> ---
>
> Key: IGNITE-6981
> URL: https://issues.apache.org/jira/browse/IGNITE-6981
> Project: Ignite
>  Issue Type: Bug
>  Components: 2.3, general
>Affects Versions: 2.3
>Reporter: Joel Lang
>Priority: Minor
> Attachments: threads.jpg
>
>
> I've observed using VisualVM that Ignite is continuously starting and 
> stopping system pool threads even while the pool is idle.
> I've attached a screenshot. Notice the high thread number on the left.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (IGNITE-6981) System thread pool continuously creates and destroys threads while idle

2017-11-21 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-6981:
--
Attachment: threads.jpg

> System thread pool continuously creates and destroys threads while idle
> ---
>
> Key: IGNITE-6981
> URL: https://issues.apache.org/jira/browse/IGNITE-6981
> Project: Ignite
>  Issue Type: Bug
>  Components: 2.3, general
>Affects Versions: 2.3
>Reporter: Joel Lang
>Priority: Minor
> Attachments: threads.jpg
>
>
> I've observed using VisualVM that Ignite is continuously starting and 
> stopping system pool threads even while the pool is idle.
> I've attached a screenshot. Notice the high thread number on the left.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IGNITE-6981) System thread pool continuously creates and destroys threads while idle

2017-11-21 Thread Joel Lang (JIRA)
Joel Lang created IGNITE-6981:
-

 Summary: System thread pool continuously creates and destroys 
threads while idle
 Key: IGNITE-6981
 URL: https://issues.apache.org/jira/browse/IGNITE-6981
 Project: Ignite
  Issue Type: Bug
  Components: 2.3, general
Affects Versions: 2.3
Reporter: Joel Lang
Priority: Minor


I've observed using VisualVM that Ignite is continuously starting and stopping 
system pool threads even while the pool is idle.

I've attached a screenshot. Notice the high thread number on the left.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6530) Supposed deadlock with persistent store

2017-09-29 Thread Joel Lang (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186139#comment-16186139
 ] 

Joel Lang commented on IGNITE-6530:
---

I think I've tracked down the issue to an attempt to remove the same set of 
entries from the cache on different threads at the same time using removeAll() 
on a hash set.

A comment here said that this could lead to a deadlock: 
https://stackoverflow.com/questions/45028962/possible-starvation-in-striped-pool-with-deadlock-true-apache-ignite

I made a possible fix by both ensuring the operation can only happen on one 
thread at a time (as it should have been), and by using a TreeSet with 
removeAll() instead of HashSet. I have yet to verify that this fix worked 
though.

I just want to clarify if this possible deadlock is documented somewhere and I 
missed it, or if this is not intended behavior? I assume any deadlock situation 
should not be allowed. If removeAll() and other batch operations require 
sorting for safety, I think the methods should require a SortedSet instance.

> Supposed deadlock with persistent store
> ---
>
> Key: IGNITE-6530
> URL: https://issues.apache.org/jira/browse/IGNITE-6530
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.2
>Reporter: Joel Lang
>
> Just started receiving a great deal of warnings about possible starvation in 
> stripped pool.
> The stack trace shows it happens in checkpointReadLock() in 
> GridCacheDatabaseSharedManager.
> Here are the log messages:
> {noformat}
> 2017-09-28 12:08:09 [grid-timeout-worker-#15%mbe%] WARN  
> o.apache.ignite.internal.diagnostic - Found long running cache future 
> [startTime=12:05:12.016, curTime=12:08:09.947, fut=GridNearAtomicUpdateFuture 
> [mappings={922dc862-feed-4245-a014-fde00f21eac1=Primary 
> [id=922dc862-feed-4245-a014-fde00f21eac1, opRes=true, expCnt=-1, rcvdCnt=0, 
> primaryRes=true, done=true, waitFor=null, rcvd=null], 
> 592ff27f-e7d2-45c0-8415-4cca0ce8e11e=Primary 
> [id=592ff27f-e7d2-45c0-8415-4cca0ce8e11e, opRes=false, expCnt=-1, rcvdCnt=0, 
> primaryRes=false, done=false, waitFor=null, rcvd=null]}, remapKeys=null, 
> singleReq=null, resCnt=1, super=GridNearAtomicAbstractUpdateFuture 
> [remapCnt=100, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=2], 
> remapTopVer=null, err=null, futId=655361, super=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=1699524044
> 2017-09-28 12:08:09 [grid-timeout-worker-#15%mbe%] WARN  
> o.apache.ignite.internal.diagnostic - Found long running cache future 
> [startTime=12:05:12.054, curTime=12:08:09.947, fut=GridNearAtomicUpdateFuture 
> [mappings={592ff27f-e7d2-45c0-8415-4cca0ce8e11e=Primary 
> [id=592ff27f-e7d2-45c0-8415-4cca0ce8e11e, opRes=false, expCnt=-1, rcvdCnt=0, 
> primaryRes=false, done=false, waitFor=null, rcvd=null], 
> 922dc862-feed-4245-a014-fde00f21eac1=Primary 
> [id=922dc862-feed-4245-a014-fde00f21eac1, opRes=true, expCnt=-1, rcvdCnt=0, 
> primaryRes=true, done=true, waitFor=null, rcvd=null]}, remapKeys=null, 
> singleReq=null, resCnt=1, super=GridNearAtomicAbstractUpdateFuture 
> [remapCnt=100, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=2], 
> remapTopVer=null, err=null, futId=671745, super=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=1175088539
> 2017-09-28 12:08:10 [grid-timeout-worker-#15%mbe%] WARN  
> o.a.ignite.internal.util.typedef.G - >>> Possible starvation in striped pool.
> Thread name: sys-stripe-0-#1%mbe%
> Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
> topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
> msg=GridNearAtomicSingleUpdateRequest [key=BinaryObjectImpl [arr= true, 
> ctx=false, start=0], parent=GridNearAtomicAbstractUpdateRequest [res=null, 
> flags=, Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
> topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
> msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=0, 
> val=null, hasValBytes=true], parent=GridNearAtomicAbstractUpdateRequest 
> [res=null, flags=, Message closure [msg=GridIoMessage [plc=2, 
> topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
> msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=0, 
> val=null, hasValBytes=true], parent=GridNearAtomicAbstractUpdateRequest 
> [res=null, flags=]
> Deadlock: true
> Completed: 3174
> Thread [name="sys-stripe-0-#1%mbe%", id=15, state=WAITING, blockCnt=4, 
> waitCnt=5789]
> Lock 
> [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2e1993dd,
>  ownerName=null, ownerId=-1]
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> at 
> 

[jira] [Updated] (IGNITE-6530) Supposed deadlock with persistent store

2017-09-28 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-6530:
--
Summary: Supposed deadlock with persistent store  (was: Deadlock in 
checkpointReadLock method in GridCacheDatabaseSharedManager)

> Supposed deadlock with persistent store
> ---
>
> Key: IGNITE-6530
> URL: https://issues.apache.org/jira/browse/IGNITE-6530
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.2
>Reporter: Joel Lang
>
> Just started receiving a great deal of warnings about possible starvation in 
> stripped pool.
> The stack trace shows it happens in checkpointReadLock() in 
> GridCacheDatabaseSharedManager.
> Here are the log messages:
> {noformat}
> 2017-09-28 12:08:09 [grid-timeout-worker-#15%mbe%] WARN  
> o.apache.ignite.internal.diagnostic - Found long running cache future 
> [startTime=12:05:12.016, curTime=12:08:09.947, fut=GridNearAtomicUpdateFuture 
> [mappings={922dc862-feed-4245-a014-fde00f21eac1=Primary 
> [id=922dc862-feed-4245-a014-fde00f21eac1, opRes=true, expCnt=-1, rcvdCnt=0, 
> primaryRes=true, done=true, waitFor=null, rcvd=null], 
> 592ff27f-e7d2-45c0-8415-4cca0ce8e11e=Primary 
> [id=592ff27f-e7d2-45c0-8415-4cca0ce8e11e, opRes=false, expCnt=-1, rcvdCnt=0, 
> primaryRes=false, done=false, waitFor=null, rcvd=null]}, remapKeys=null, 
> singleReq=null, resCnt=1, super=GridNearAtomicAbstractUpdateFuture 
> [remapCnt=100, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=2], 
> remapTopVer=null, err=null, futId=655361, super=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=1699524044
> 2017-09-28 12:08:09 [grid-timeout-worker-#15%mbe%] WARN  
> o.apache.ignite.internal.diagnostic - Found long running cache future 
> [startTime=12:05:12.054, curTime=12:08:09.947, fut=GridNearAtomicUpdateFuture 
> [mappings={592ff27f-e7d2-45c0-8415-4cca0ce8e11e=Primary 
> [id=592ff27f-e7d2-45c0-8415-4cca0ce8e11e, opRes=false, expCnt=-1, rcvdCnt=0, 
> primaryRes=false, done=false, waitFor=null, rcvd=null], 
> 922dc862-feed-4245-a014-fde00f21eac1=Primary 
> [id=922dc862-feed-4245-a014-fde00f21eac1, opRes=true, expCnt=-1, rcvdCnt=0, 
> primaryRes=true, done=true, waitFor=null, rcvd=null]}, remapKeys=null, 
> singleReq=null, resCnt=1, super=GridNearAtomicAbstractUpdateFuture 
> [remapCnt=100, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=2], 
> remapTopVer=null, err=null, futId=671745, super=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=1175088539
> 2017-09-28 12:08:10 [grid-timeout-worker-#15%mbe%] WARN  
> o.a.ignite.internal.util.typedef.G - >>> Possible starvation in striped pool.
> Thread name: sys-stripe-0-#1%mbe%
> Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
> topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
> msg=GridNearAtomicSingleUpdateRequest [key=BinaryObjectImpl [arr= true, 
> ctx=false, start=0], parent=GridNearAtomicAbstractUpdateRequest [res=null, 
> flags=, Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
> topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
> msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=0, 
> val=null, hasValBytes=true], parent=GridNearAtomicAbstractUpdateRequest 
> [res=null, flags=, Message closure [msg=GridIoMessage [plc=2, 
> topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
> msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=0, 
> val=null, hasValBytes=true], parent=GridNearAtomicAbstractUpdateRequest 
> [res=null, flags=]
> Deadlock: true
> Completed: 3174
> Thread [name="sys-stripe-0-#1%mbe%", id=15, state=WAITING, blockCnt=4, 
> waitCnt=5789]
> Lock 
> [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2e1993dd,
>  ownerName=null, ownerId=-1]
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
>  Source)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(Unknown 
> Source)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(Unknown 
> Source)
> at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(Unknown 
> Source)
> at 
> o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:847)
> at 
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1770)
> at 
> 

[jira] [Updated] (IGNITE-6530) Deadlock in checkpointReadLock method in GridCacheDatabaseSharedManager

2017-09-28 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-6530:
--
Description: 
Just started receiving a great deal of warnings about possible starvation in 
stripped pool.

The stack trace shows it happens in checkpointReadLock() in 
GridCacheDatabaseSharedManager.

Here are the log messages:


{noformat}
2017-09-28 12:08:09 [grid-timeout-worker-#15%mbe%] WARN  
o.apache.ignite.internal.diagnostic - Found long running cache future 
[startTime=12:05:12.016, curTime=12:08:09.947, fut=GridNearAtomicUpdateFuture 
[mappings={922dc862-feed-4245-a014-fde00f21eac1=Primary 
[id=922dc862-feed-4245-a014-fde00f21eac1, opRes=true, expCnt=-1, rcvdCnt=0, 
primaryRes=true, done=true, waitFor=null, rcvd=null], 
592ff27f-e7d2-45c0-8415-4cca0ce8e11e=Primary 
[id=592ff27f-e7d2-45c0-8415-4cca0ce8e11e, opRes=false, expCnt=-1, rcvdCnt=0, 
primaryRes=false, done=false, waitFor=null, rcvd=null]}, remapKeys=null, 
singleReq=null, resCnt=1, super=GridNearAtomicAbstractUpdateFuture 
[remapCnt=100, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=2], 
remapTopVer=null, err=null, futId=655361, super=GridFutureAdapter 
[ignoreInterrupts=false, state=INIT, res=null, hash=1699524044
2017-09-28 12:08:09 [grid-timeout-worker-#15%mbe%] WARN  
o.apache.ignite.internal.diagnostic - Found long running cache future 
[startTime=12:05:12.054, curTime=12:08:09.947, fut=GridNearAtomicUpdateFuture 
[mappings={592ff27f-e7d2-45c0-8415-4cca0ce8e11e=Primary 
[id=592ff27f-e7d2-45c0-8415-4cca0ce8e11e, opRes=false, expCnt=-1, rcvdCnt=0, 
primaryRes=false, done=false, waitFor=null, rcvd=null], 
922dc862-feed-4245-a014-fde00f21eac1=Primary 
[id=922dc862-feed-4245-a014-fde00f21eac1, opRes=true, expCnt=-1, rcvdCnt=0, 
primaryRes=true, done=true, waitFor=null, rcvd=null]}, remapKeys=null, 
singleReq=null, resCnt=1, super=GridNearAtomicAbstractUpdateFuture 
[remapCnt=100, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=2], 
remapTopVer=null, err=null, futId=671745, super=GridFutureAdapter 
[ignoreInterrupts=false, state=INIT, res=null, hash=1175088539
2017-09-28 12:08:10 [grid-timeout-worker-#15%mbe%] WARN  
o.a.ignite.internal.util.typedef.G - >>> Possible starvation in striped pool.
Thread name: sys-stripe-0-#1%mbe%
Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
msg=GridNearAtomicSingleUpdateRequest [key=BinaryObjectImpl [arr= true, 
ctx=false, start=0], parent=GridNearAtomicAbstractUpdateRequest [res=null, 
flags=, Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=0, 
val=null, hasValBytes=true], parent=GridNearAtomicAbstractUpdateRequest 
[res=null, flags=, Message closure [msg=GridIoMessage [plc=2, 
topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=0, 
val=null, hasValBytes=true], parent=GridNearAtomicAbstractUpdateRequest 
[res=null, flags=]
Deadlock: true
Completed: 3174
Thread [name="sys-stripe-0-#1%mbe%", id=15, state=WAITING, blockCnt=4, 
waitCnt=5789]
Lock 
[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2e1993dd, 
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
 Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(Unknown 
Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(Unknown 
Source)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(Unknown Source)
at 
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:847)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1770)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1686)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3063)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:129)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:265)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:260)
at 
o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1042)
at 

[jira] [Created] (IGNITE-6530) Deadlock in checkpointReadLock method in GridCacheDatabaseSharedManager

2017-09-28 Thread Joel Lang (JIRA)
Joel Lang created IGNITE-6530:
-

 Summary: Deadlock in checkpointReadLock method in 
GridCacheDatabaseSharedManager
 Key: IGNITE-6530
 URL: https://issues.apache.org/jira/browse/IGNITE-6530
 Project: Ignite
  Issue Type: Bug
  Components: persistence
Affects Versions: 2.2
Reporter: Joel Lang


Just started receiving a great deal of warnings about possible starvation in 
stripped pool.

The stack trace shows it happens in checkpointReadLock() in 
GridCacheDatabaseSharedManager.

Here are the log messages:

{noformat}
2017-09-28 13:15:12 [grid-timeout-worker-#15%mbe%] WARN  
o.a.ignite.internal.util.typedef.G - >>> Possible starvation in striped pool.
Thread name: sys-stripe-4-#5%mbe%
Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
msg=GridNearAtomicSingleUpdateRequest [key=BinaryObjectImpl [arr= true, 
ctx=false, start=0], parent=GridNearAtomicAbstractUpdateRequest [res=null, 
flags=]
Deadlock: true
Completed: 3212
Thread [name="sys-stripe-4-#5%mbe%", id=19, state=WAITING, blockCnt=12, 
waitCnt=5835]
Lock 
[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2e1993dd, 
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
 Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(Unknown 
Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(Unknown 
Source)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(Unknown Source)
at 
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:847)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1770)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1686)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3063)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:129)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:265)
at 
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:260)
at 
o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1042)
at 
o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:561)
at 
o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:378)
at 
o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:304)
at 
o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:99)
at 
o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:293)
at 
o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at 
o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at 
o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:126)
at 
o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1097)
at o.a.i.i.util.StripedExecutor$Stripe.run(StripedExecutor.java:483)
at java.lang.Thread.run(Unknown Source)

2017-09-28 13:15:12 [grid-timeout-worker-#15%mbe%] WARN  
o.a.ignite.internal.util.typedef.G - >>> Possible starvation in striped pool.
Thread name: sys-stripe-5-#6%mbe%
Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, 
topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, 
msg=GridNearAtomicSingleUpdateRequest [key=BinaryObjectImpl [arr= true, 
ctx=false, start=0], parent=GridNearAtomicAbstractUpdateRequest [res=null, 
flags=]
Deadlock: true
Completed: 3524
Thread [name="sys-stripe-5-#6%mbe%", id=20, state=WAITING, blockCnt=3, 
waitCnt=6730]
Lock 
[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2e1993dd, 
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
 Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(Unknown 
Source)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(Unknown 
Source)
 

[jira] [Updated] (IGNITE-6506) Cluster activation hangs if a node was stopped during persistent storage checkpoint

2017-09-26 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-6506:
--
Description: 
I have a cluster with two nodes: A and B.

On startup, node A and B wait for each other to be connected and then node A 
will attempt to activate the cluster.

While testing high availability we find that if a node is stopped during the 
persistent store checkpoint, we cannot activate the cluster on startup without 
deleting the persistent storage directory. Specifically in the case where node 
A is stopped during checkpointing, upon the next startup it will encounter 
several exceptions during activation and then hang without completing 
activation.

Here is the log.


{noformat}
2017-09-26 12:11:24 [tcp-disco-msg-worker-#2%mbe%] INFO  
o.a.i.i.p.c.GridClusterStateProcessor - Start state transition: true
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.ignite.internal.exchange.time - Started exchange init 
[topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1], crd=true, evt=18, 
node=TcpDiscoveryNode [id=62cf0ccb-e376-4b80-8d2d-98115c3a2990, 
addrs=[10.5.17.19, 127.0.0.1], 
sockAddrs=[shouvdevmbe02.petrolink.net/10.5.17.19:47510, /127.0.0.1:47510], 
discPort=47510, order=1, intOrder=1, lastExchangeTime=1506445884063, loc=true, 
ver=2.2.0#20170915-sha1:5747ce6b, isClient=false], evtNode=TcpDiscoveryNode 
[id=62cf0ccb-e376-4b80-8d2d-98115c3a2990, addrs=[10.5.17.19, 127.0.0.1], 
sockAddrs=[shouvdevmbe02.petrolink.net/10.5.17.19:47510, /127.0.0.1:47510], 
discPort=47510, order=1, intOrder=1, lastExchangeTime=1506445884063, loc=true, 
ver=2.2.0#20170915-sha1:5747ce6b, isClient=false], 
customEvt=ChangeGlobalStateMessage 
[id=1d0cb2fbe51-7967bd11-40aa-40fe-b0a6-c43302cd4ee7, 
reqId=f7155dea-fede-4340-b244-7a3b65f167a8, 
initiatingNodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, activate=true]]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Start activation process 
[nodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, client=false, 
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1]]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.f.FilePageStoreManager - Resolved page store work directory: 
/opt/mbe1/ignite/db/mbe_MBE1
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.w.FileWriteAheadLogManager - Resolved write ahead log work 
directory: /opt/mbe1/ignite/db/wal/mbe_MBE1
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.w.FileWriteAheadLogManager - Resolved write ahead log archive 
directory: /opt/mbe1/ignite/db/wal/archive/mbe_MBE1
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] WARN  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - No user-defined default 
MemoryPolicy found; system default of 1GB size will be used.
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.pagemem.PageMemoryImpl - Started page memory 
[memoryAllocated=100.0 MiB, pages=48592, tableSize=2.9 MiB, 
checkpointBuffer=819.4 MiB]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.pagemem.PageMemoryImpl - Started page memory [memoryAllocated=3.1 
GiB, pages=1544064, tableSize=91.0 MiB, checkpointBuffer=819.4 MiB]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Read checkpoint status: start 
marker = 
/opt/mbe1/ignite/db/mbe_MBE1/cp/1506444061104-38b80aaa-8c3d-4572-a42e-5b7a3b472505-START.bin,
 end marker = 
/opt/mbe1/ignite/db/mbe_MBE1/cp/1506442980839-ff65a0dc-3d83-436a-8329-7b3a31fe5ffc-END.bin
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Checking memory state 
[lastValidPos=FileWALPointer [idx=139, fileOffset=31406805, len=20731, 
forceFlush=false], lastMarked=FileWALPointer [idx=0, fileOffset=0, len=0, 
forceFlush=false], lastCheckpointId=38b80aaa-8c3d-4572-a42e-5b7a3b472505]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] WARN  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Ignite node stopped in the 
middle of checkpoint. Will restore memory state and finish checkpoint on node 
start.
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] ERROR 
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Failed to activate node 
components [nodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, client=false, 
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1]]
java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:82)
at 
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:92)
at 
org.apache.ignite.internal.pagemem.wal.record.delta.DataPageInsertRecord.applyDelta(DataPageInsertRecord.java:57)
at 

[jira] [Updated] (IGNITE-6506) Cluster activation hangs if a node was stopped during persistent storage checkpoint

2017-09-26 Thread Joel Lang (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Lang updated IGNITE-6506:
--
Summary: Cluster activation hangs if a node was stopped during persistent 
storage checkpoint  (was: Unable to activate cluster node if it was stopped 
during persistent storage checkpoint)

> Cluster activation hangs if a node was stopped during persistent storage 
> checkpoint
> ---
>
> Key: IGNITE-6506
> URL: https://issues.apache.org/jira/browse/IGNITE-6506
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.2
>Reporter: Joel Lang
>Priority: Critical
>
> I have a cluster with two nodes: A and B.
> On startup, node A and B wait for each other to be connected and then node A 
> will attempt to activate the cluster.
> While testing high availability we find that if a node is stopped during the 
> persistent store checkpoint, we cannot activate the cluster on startup 
> without deleting the persistent storage directory. Specifically in the case 
> where node A is stopped during checkpointing, upon the next startup it will 
> encounter several exceptions during activation and then hang without 
> completing activation.
> Here is the log.
> {noformat}
> 2017-09-26 12:11:24 [tcp-disco-msg-worker-#2%mbe%] INFO  
> o.a.i.i.p.c.GridClusterStateProcessor - Start state transition: true
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.ignite.internal.exchange.time - Started exchange init 
> [topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1], crd=true, evt=18, 
> node=TcpDiscoveryNode [id=62cf0ccb-e376-4b80-8d2d-98115c3a2990, 
> addrs=[10.5.17.19, 127.0.0.1], 
> sockAddrs=[shouvdevmbe02.petrolink.net/10.5.17.19:47510, /127.0.0.1:47510], 
> discPort=47510, order=1, intOrder=1, lastExchangeTime=1506445884063, 
> loc=true, ver=2.2.0#20170915-sha1:5747ce6b, isClient=false], 
> evtNode=TcpDiscoveryNode [id=62cf0ccb-e376-4b80-8d2d-98115c3a2990, 
> addrs=[10.5.17.19, 127.0.0.1], 
> sockAddrs=[shouvdevmbe02.petrolink.net/10.5.17.19:47510, /127.0.0.1:47510], 
> discPort=47510, order=1, intOrder=1, lastExchangeTime=1506445884063, 
> loc=true, ver=2.2.0#20170915-sha1:5747ce6b, isClient=false], 
> customEvt=ChangeGlobalStateMessage 
> [id=1d0cb2fbe51-7967bd11-40aa-40fe-b0a6-c43302cd4ee7, 
> reqId=f7155dea-fede-4340-b244-7a3b65f167a8, 
> initiatingNodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, activate=true]]
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Start activation process 
> [nodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, client=false, 
> topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1]]
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.p.f.FilePageStoreManager - Resolved page store work directory: 
> /opt/mbe1/ignite/db/mbe_MBE1
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.p.w.FileWriteAheadLogManager - Resolved write ahead log work 
> directory: /opt/mbe1/ignite/db/wal/mbe_MBE1
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.p.w.FileWriteAheadLogManager - Resolved write ahead log archive 
> directory: /opt/mbe1/ignite/db/wal/archive/mbe_MBE1
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] WARN  
> o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - No user-defined default 
> MemoryPolicy found; system default of 1GB size will be used.
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.p.pagemem.PageMemoryImpl - Started page memory 
> [memoryAllocated=100.0 MiB, pages=48592, tableSize=2.9 MiB, 
> checkpointBuffer=819.4 MiB]
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.p.pagemem.PageMemoryImpl - Started page memory 
> [memoryAllocated=3.1 GiB, pages=1544064, tableSize=91.0 MiB, 
> checkpointBuffer=819.4 MiB]
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Read checkpoint status: start 
> marker = 
> /opt/mbe1/ignite/db/mbe_MBE1/cp/1506444061104-38b80aaa-8c3d-4572-a42e-5b7a3b472505-START.bin,
>  end marker = 
> /opt/mbe1/ignite/db/mbe_MBE1/cp/1506442980839-ff65a0dc-3d83-436a-8329-7b3a31fe5ffc-END.bin
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
> o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Checking memory state 
> [lastValidPos=FileWALPointer [idx=139, fileOffset=31406805, len=20731, 
> forceFlush=false], lastMarked=FileWALPointer [idx=0, fileOffset=0, len=0, 
> forceFlush=false], lastCheckpointId=38b80aaa-8c3d-4572-a42e-5b7a3b472505]
> 2017-09-26 12:11:24 [exchange-worker-#34%mbe%] WARN  
> o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Ignite node stopped in the 
> middle of checkpoint. Will restore memory state and finish checkpoint on node 
> start.
> 2017-09-26 

[jira] [Created] (IGNITE-6506) Unable to activate cluster node if it was stopped during persistent storage checkpoint

2017-09-26 Thread Joel Lang (JIRA)
Joel Lang created IGNITE-6506:
-

 Summary: Unable to activate cluster node if it was stopped during 
persistent storage checkpoint
 Key: IGNITE-6506
 URL: https://issues.apache.org/jira/browse/IGNITE-6506
 Project: Ignite
  Issue Type: Bug
  Components: persistence
Affects Versions: 2.2
Reporter: Joel Lang
Priority: Critical


I have a cluster with two nodes: A and B.

On startup, node A and B wait for each other to be connected and then node A 
will attempt to activate the cluster.

While testing high availability we find that if a node is stopped during the 
persistent store checkpoint, we cannot activate the cluster on startup without 
deleting the persistent storage directory. Specifically in the case where node 
A is stopped during checkpointing, upon the next startup it will encounter 
several exceptions during activation and then hang without completing 
activation.

Here is the log.


{noformat}
2017-09-26 12:11:24 [tcp-disco-msg-worker-#2%mbe%] INFO  
o.a.i.i.p.c.GridClusterStateProcessor - Start state transition: true
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.ignite.internal.exchange.time - Started exchange init 
[topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1], crd=true, evt=18, 
node=TcpDiscoveryNode [id=62cf0ccb-e376-4b80-8d2d-98115c3a2990, 
addrs=[10.5.17.19, 127.0.0.1], 
sockAddrs=[shouvdevmbe02.petrolink.net/10.5.17.19:47510, /127.0.0.1:47510], 
discPort=47510, order=1, intOrder=1, lastExchangeTime=1506445884063, loc=true, 
ver=2.2.0#20170915-sha1:5747ce6b, isClient=false], evtNode=TcpDiscoveryNode 
[id=62cf0ccb-e376-4b80-8d2d-98115c3a2990, addrs=[10.5.17.19, 127.0.0.1], 
sockAddrs=[shouvdevmbe02.petrolink.net/10.5.17.19:47510, /127.0.0.1:47510], 
discPort=47510, order=1, intOrder=1, lastExchangeTime=1506445884063, loc=true, 
ver=2.2.0#20170915-sha1:5747ce6b, isClient=false], 
customEvt=ChangeGlobalStateMessage 
[id=1d0cb2fbe51-7967bd11-40aa-40fe-b0a6-c43302cd4ee7, 
reqId=f7155dea-fede-4340-b244-7a3b65f167a8, 
initiatingNodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, activate=true]]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Start activation process 
[nodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, client=false, 
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1]]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.f.FilePageStoreManager - Resolved page store work directory: 
/opt/mbe1/ignite/db/mbe_MBE1
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.w.FileWriteAheadLogManager - Resolved write ahead log work 
directory: /opt/mbe1/ignite/db/wal/mbe_MBE1
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.w.FileWriteAheadLogManager - Resolved write ahead log archive 
directory: /opt/mbe1/ignite/db/wal/archive/mbe_MBE1
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] WARN  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - No user-defined default 
MemoryPolicy found; system default of 1GB size will be used.
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.pagemem.PageMemoryImpl - Started page memory 
[memoryAllocated=100.0 MiB, pages=48592, tableSize=2.9 MiB, 
checkpointBuffer=819.4 MiB]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.pagemem.PageMemoryImpl - Started page memory [memoryAllocated=3.1 
GiB, pages=1544064, tableSize=91.0 MiB, checkpointBuffer=819.4 MiB]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Read checkpoint status: start 
marker = 
/opt/mbe1/ignite/db/mbe_MBE1/cp/1506444061104-38b80aaa-8c3d-4572-a42e-5b7a3b472505-START.bin,
 end marker = 
/opt/mbe1/ignite/db/mbe_MBE1/cp/1506442980839-ff65a0dc-3d83-436a-8329-7b3a31fe5ffc-END.bin
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] INFO  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Checking memory state 
[lastValidPos=FileWALPointer [idx=139, fileOffset=31406805, len=20731, 
forceFlush=false], lastMarked=FileWALPointer [idx=0, fileOffset=0, len=0, 
forceFlush=false], lastCheckpointId=38b80aaa-8c3d-4572-a42e-5b7a3b472505]
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] WARN  
o.a.i.i.p.c.p.GridCacheDatabaseSharedManager - Ignite node stopped in the 
middle of checkpoint. Will restore memory state and finish checkpoint on node 
start.
2017-09-26 12:11:24 [exchange-worker-#34%mbe%] ERROR 
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Failed to activate node 
components [nodeId=62cf0ccb-e376-4b80-8d2d-98115c3a2990, client=false, 
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1]]
java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:82)
at 
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:92)
at