Re: Operation block on Cluster recovery/rebalance.

Denis Magda Wed, 12 Aug 2020 15:29:23 -0700

John,

I don't see any traits of an application-caused deadlock in the thread
dumps. Please elaborate on the following:


7- Restart 1st node, run operation, operation fails with
> ClientDisconectedException but application still able to complete it's
> request.


What's the IP address of the server node the client app uses to join the
cluster? If that's not the address of the 1st node, that is already
restarted, then the client couldn't join the cluster and it's expected that
it fails with the ClientDisconnectedException.

8- Start 2nd node, run operation, from here on all operations just block.


Are the operations unblocked and completed successfully when the third node
joins the cluster and the cluster gets activated automatically?

-
Denis


On Wed, Aug 12, 2020 at 11:08 AM John Smith <[email protected]> wrote:

> Ok Denis here they are...
>
> 3 nodes and I capture a yourlit screenshot of what it thinks are deadlocks
> on the client app.
>
> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>
> On Wed, 12 Aug 2020 at 11:07, John Smith <[email protected]> wrote:
>
>> Hi Denis. I will asap but you I think you were right it is the query that
>> blocks.
>>
>> My application first first runs a select on the cache and then does a put
>> to cache.
>>
>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <[email protected]> wrote:
>>
>>> John,
>>>
>>> It sounds like a deadlock caused by the application logic. Is there any
>>> chance that the operation you run on step 8 accesses several keys in one
>>> order while the other operations work with the same keys but in a different
>>> order. The deadlocks are possible when you use Ignite Transaction API or
>>> simply execute bulk operations such as cache.readAll() or
>>> cache.writeAll(..).
>>>
>>> Please take and attach thread dumps from all the cluster nodes for
>>> analysis if we need to dig deeper.
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <[email protected]>
>>> wrote:
>>>
>>>> Hi Denis, I think you are right. It's the query that blocks the other
>>>> k/v operations are ok.
>>>>
>>>> Any thoughts on this?
>>>>
>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <[email protected]>
>>>> wrote:
>>>>
>>>>> I tried with 2.8.1, same issue. Operations block indefinitely...
>>>>>
>>>>> 1- Start 3 node cluster
>>>>> 2- Start client application client = true with Ignition.start()
>>>>> 3- Run some cache operations, everything ok...
>>>>> 4- Shut down one node, run operation, still ok
>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>> 6- Shut down 3rd node, run operation, still ok... Operations start
>>>>> failing with ClientDisconectedException...
>>>>> 7- Restart 1st node, run operation, operation fails
>>>>> with ClientDisconectedException but application still able to complete 
>>>>> it's
>>>>> request.
>>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>> block.
>>>>>
>>>>> Basically the client application is an HTTP Server on each HTTP
>>>>> request does cache exception.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>
>>>>>> Only time I get exception is if the cluster is completely off, then I
>>>>>> get ClientDisconectedException...
>>>>>>
>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <[email protected]> wrote:
>>>>>>
>>>>>>> If I'm not mistaken, key-value operations (cache.get/put) and
>>>>>>> compute calls fail with an exception if the cluster is deactivated. Do
>>>>>>> those fail on your end?
>>>>>>>
>>>>>>> As for the async and SQL operations, let's see what other community
>>>>>>> members say.
>>>>>>>
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi any thoughts on this?
>>>>>>>>
>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>
>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>         "select * from my_table")
>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>
>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>
>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>
>>>>>>>>> Is there a way to timeout and at least have the application
>>>>>>>>> continue and respond with an appropriate message?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>
>>>>>>>>>> When I reboot a node and it begins to rejoin the cluster or the
>>>>>>>>>> cluster is not yet activated with baseline topology operations seem 
>>>>>>>>>> to
>>>>>>>>>> block forever, operations that are supposed to return IgniteFuture. 
>>>>>>>>>> I.e:
>>>>>>>>>> putAsync, getAsync etc... They just block, until the cluster 
>>>>>>>>>> resolves it's
>>>>>>>>>> state.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to