Re: Odg: Geode retry/acknowledge improvement

2021-05-05 Thread Alberto Gomez
Please, disregard my last e-mail.

I was having a parallel conversation by e-mail with Mario on this topic and 
sent the e-mail to the list by mistake.

BR,

Alberto

From: Alberto Gomez 
Sent: Wednesday, May 5, 2021 11:29 AM
To: dev@geode.apache.org 
Subject: Re: Odg: Geode retry/acknowledge improvement

You could answer to their latest e-mail to confirm that Darrel's suspicion 
could happen. Let's see if in that case they are willing to collaborate.

Alberto

From: Mario Ivanac 
Sent: Wednesday, May 5, 2021 11:28 AM
To: dev@geode.apache.org 
Subject: Odg: Odg: Geode retry/acknowledge improvement

Hi,

I think that we have problem that Darrel was suspicious, and that some kind of 
notification could be send from peer-to-peer to acknowledge that message is 
received on receiving side.

Regarding test with ip tables, execution gets stuck with conserve-sockets set 
to false or true.

BR,
Mario

Šalje: Darrel Schneider 
Poslano: 30. travnja 2021. 18:38
Prima: dev@geode.apache.org 
Predmet: Re: Odg: Geode retry/acknowledge improvement

In the geode hang you describe would the forced tcp-reset using iptables have 
cause the put send message to fail with an exception writing it to the socket? 
If so then I'd expect the geode Connection class to keep trying to send that 
message by creating a new connection to the member. It will keep doing this 
until the send is successful or the member leaves the cluster.

But if the tcp-reset allows the send to complete, without actually sending the 
request to the other member, then geode will be in trouble and will wait 
forever for a reply. Once geode successfully writes a p2p message on a socket, 
it expects it to be processed on the other side OR it expects the other side to 
leave the geode cluster. If neither of these happen then it will wait forever 
for a response. I've wondered in the past if this was a safe expectation. If 
not then do we need to send some type of msg id and after waiting for a reply 
for too long be able to check with the member to see if it has received the 
message we think we already sent?

You might see different behavior with your iptables test if you use 
conserve-sockets=false. In that case the socket used to write the p2p message 
is also used to read the response. But in the default conserve-sockets=true 
case, the reply comes on a different socket than the one used to send the 
message. It might be hard to get the thread doing the put for gfsh to use 
conserve-sockets=false. You could try just setting that on your server and the 
stuck thread stack should look different from what you are currently seeing.

From: Anthony Baker 
Sent: Friday, April 30, 2021 8:43 AM
To: dev@geode.apache.org 
Subject: Re: Odg: Geode retry/acknowledge improvement

Can you explain the scenario further?  Does the sidecar proxy both the sending 
and receiving socket (geode creates 2 sockets for each p2p member)?  In normal 
cases, closing these sockets should clear up any unacknowledged messages, 
freeing up the thread.

Anthony


> On Apr 20, 2021, at 7:31 AM, Mario Ivanac  wrote:
>
> Hi,
>
> after analysis, we  assume that proxy at reception of packets,  sends ACK on 
> TCP level, and after that moment proxy is restarted.
> This is the reason, we dont see tcp retries.
>
> Simular problem to this (but not packet loss), can be reproduce on geode,
> if on existing connection, after request is sent, tcp reset is received. In 
> that case, at reception of reset
> connection will be closed, and thread will get stuck while waiting on reply.
> I will add reproduction steps in ticket.
>
> 
> Šalje: Anthony Baker 
> Poslano: 19. travnja 2021. 22:54
> Prima: dev@geode.apache.org 
> Predmet: Re: Geode retry/acknowledge improvement
>
> Do you have a tcpdump that demonstrates the packet loss? How long did you 
> wait for TCP to retry the failed packet delivery (sometimes this can be 
> tweaked with tcp_retries2).  Does this manifest as a failed socket connection 
> in geode?  That ought to trigger some error handling IIRC.
>
> Anthony
>
>
>> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>>
>> Hi all,
>>
>> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
>> are injected between cluster members.
>> While running traffic, if any Istio/SideCar is restarted, thread will get 
>> stuck indefinitely, while waiting for reply on sent message.
>> It seams that due to restarting of proxy, in some cases, messages are lost, 
>> and sending side is waiting indefinitely for reply.
>>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse

Odg: Odg: Geode retry/acknowledge improvement

2021-05-05 Thread Mario Ivanac
I think that this is enough.

Šalje: Alberto Gomez 
Poslano: 5. svibnja 2021. 11:29
Prima: dev@geode.apache.org 
Predmet: Re: Odg: Geode retry/acknowledge improvement

You could answer to their latest e-mail to confirm that Darrel's suspicion 
could happen. Let's see if in that case they are willing to collaborate.

Alberto

From: Mario Ivanac 
Sent: Wednesday, May 5, 2021 11:28 AM
To: dev@geode.apache.org 
Subject: Odg: Odg: Geode retry/acknowledge improvement

Hi,

I think that we have problem that Darrel was suspicious, and that some kind of 
notification could be send from peer-to-peer to acknowledge that message is 
received on receiving side.

Regarding test with ip tables, execution gets stuck with conserve-sockets set 
to false or true.

BR,
Mario

Šalje: Darrel Schneider 
Poslano: 30. travnja 2021. 18:38
Prima: dev@geode.apache.org 
Predmet: Re: Odg: Geode retry/acknowledge improvement

In the geode hang you describe would the forced tcp-reset using iptables have 
cause the put send message to fail with an exception writing it to the socket? 
If so then I'd expect the geode Connection class to keep trying to send that 
message by creating a new connection to the member. It will keep doing this 
until the send is successful or the member leaves the cluster.

But if the tcp-reset allows the send to complete, without actually sending the 
request to the other member, then geode will be in trouble and will wait 
forever for a reply. Once geode successfully writes a p2p message on a socket, 
it expects it to be processed on the other side OR it expects the other side to 
leave the geode cluster. If neither of these happen then it will wait forever 
for a response. I've wondered in the past if this was a safe expectation. If 
not then do we need to send some type of msg id and after waiting for a reply 
for too long be able to check with the member to see if it has received the 
message we think we already sent?

You might see different behavior with your iptables test if you use 
conserve-sockets=false. In that case the socket used to write the p2p message 
is also used to read the response. But in the default conserve-sockets=true 
case, the reply comes on a different socket than the one used to send the 
message. It might be hard to get the thread doing the put for gfsh to use 
conserve-sockets=false. You could try just setting that on your server and the 
stuck thread stack should look different from what you are currently seeing.

From: Anthony Baker 
Sent: Friday, April 30, 2021 8:43 AM
To: dev@geode.apache.org 
Subject: Re: Odg: Geode retry/acknowledge improvement

Can you explain the scenario further?  Does the sidecar proxy both the sending 
and receiving socket (geode creates 2 sockets for each p2p member)?  In normal 
cases, closing these sockets should clear up any unacknowledged messages, 
freeing up the thread.

Anthony


> On Apr 20, 2021, at 7:31 AM, Mario Ivanac  wrote:
>
> Hi,
>
> after analysis, we  assume that proxy at reception of packets,  sends ACK on 
> TCP level, and after that moment proxy is restarted.
> This is the reason, we dont see tcp retries.
>
> Simular problem to this (but not packet loss), can be reproduce on geode,
> if on existing connection, after request is sent, tcp reset is received. In 
> that case, at reception of reset
> connection will be closed, and thread will get stuck while waiting on reply.
> I will add reproduction steps in ticket.
>
> 
> Šalje: Anthony Baker 
> Poslano: 19. travnja 2021. 22:54
> Prima: dev@geode.apache.org 
> Predmet: Re: Geode retry/acknowledge improvement
>
> Do you have a tcpdump that demonstrates the packet loss? How long did you 
> wait for TCP to retry the failed packet delivery (sometimes this can be 
> tweaked with tcp_retries2).  Does this manifest as a failed socket connection 
> in geode?  That ought to trigger some error handling IIRC.
>
> Anthony
>
>
>> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>>
>> Hi all,
>>
>> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
>> are injected between cluster members.
>> While running traffic, if any Istio/SideCar is restarted, thread will get 
>> stuck indefinitely, while waiting for reply on sent message.
>> It seams that due to restarting of proxy, in some cases, messages are lost, 
>> and sending side is waiting indefinitely for reply.
>>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9075data=04%7C01%7Cdarrel%40vmware.com%7C34dc38a12a744a5594a108d90beec365%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637553942381055798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjo

Re: Odg: Geode retry/acknowledge improvement

2021-05-05 Thread Alberto Gomez
You could answer to their latest e-mail to confirm that Darrel's suspicion 
could happen. Let's see if in that case they are willing to collaborate.

Alberto

From: Mario Ivanac 
Sent: Wednesday, May 5, 2021 11:28 AM
To: dev@geode.apache.org 
Subject: Odg: Odg: Geode retry/acknowledge improvement

Hi,

I think that we have problem that Darrel was suspicious, and that some kind of 
notification could be send from peer-to-peer to acknowledge that message is 
received on receiving side.

Regarding test with ip tables, execution gets stuck with conserve-sockets set 
to false or true.

BR,
Mario

Šalje: Darrel Schneider 
Poslano: 30. travnja 2021. 18:38
Prima: dev@geode.apache.org 
Predmet: Re: Odg: Geode retry/acknowledge improvement

In the geode hang you describe would the forced tcp-reset using iptables have 
cause the put send message to fail with an exception writing it to the socket? 
If so then I'd expect the geode Connection class to keep trying to send that 
message by creating a new connection to the member. It will keep doing this 
until the send is successful or the member leaves the cluster.

But if the tcp-reset allows the send to complete, without actually sending the 
request to the other member, then geode will be in trouble and will wait 
forever for a reply. Once geode successfully writes a p2p message on a socket, 
it expects it to be processed on the other side OR it expects the other side to 
leave the geode cluster. If neither of these happen then it will wait forever 
for a response. I've wondered in the past if this was a safe expectation. If 
not then do we need to send some type of msg id and after waiting for a reply 
for too long be able to check with the member to see if it has received the 
message we think we already sent?

You might see different behavior with your iptables test if you use 
conserve-sockets=false. In that case the socket used to write the p2p message 
is also used to read the response. But in the default conserve-sockets=true 
case, the reply comes on a different socket than the one used to send the 
message. It might be hard to get the thread doing the put for gfsh to use 
conserve-sockets=false. You could try just setting that on your server and the 
stuck thread stack should look different from what you are currently seeing.

From: Anthony Baker 
Sent: Friday, April 30, 2021 8:43 AM
To: dev@geode.apache.org 
Subject: Re: Odg: Geode retry/acknowledge improvement

Can you explain the scenario further?  Does the sidecar proxy both the sending 
and receiving socket (geode creates 2 sockets for each p2p member)?  In normal 
cases, closing these sockets should clear up any unacknowledged messages, 
freeing up the thread.

Anthony


> On Apr 20, 2021, at 7:31 AM, Mario Ivanac  wrote:
>
> Hi,
>
> after analysis, we  assume that proxy at reception of packets,  sends ACK on 
> TCP level, and after that moment proxy is restarted.
> This is the reason, we dont see tcp retries.
>
> Simular problem to this (but not packet loss), can be reproduce on geode,
> if on existing connection, after request is sent, tcp reset is received. In 
> that case, at reception of reset
> connection will be closed, and thread will get stuck while waiting on reply.
> I will add reproduction steps in ticket.
>
> 
> Šalje: Anthony Baker 
> Poslano: 19. travnja 2021. 22:54
> Prima: dev@geode.apache.org 
> Predmet: Re: Geode retry/acknowledge improvement
>
> Do you have a tcpdump that demonstrates the packet loss? How long did you 
> wait for TCP to retry the failed packet delivery (sometimes this can be 
> tweaked with tcp_retries2).  Does this manifest as a failed socket connection 
> in geode?  That ought to trigger some error handling IIRC.
>
> Anthony
>
>
>> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>>
>> Hi all,
>>
>> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
>> are injected between cluster members.
>> While running traffic, if any Istio/SideCar is restarted, thread will get 
>> stuck indefinitely, while waiting for reply on sent message.
>> It seams that due to restarting of proxy, in some cases, messages are lost, 
>> and sending side is waiting indefinitely for reply.
>>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9075data=04%7C01%7Cdarrel%40vmware.com%7C34dc38a12a744a5594a108d90beec365%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637553942381055798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=VBtRAp6cQx1FEN6h4vBrjcqr3Rxa98JBUBc2Jfl%2F5iU%3Dreserved=0
>>
>> My question is, what is your estimation, how much effort/work is needed to 
>> implement message retry/acknowledge logic in geode,
>> to solve this problem?
>>
>> BR,
>> Mario
>



Odg: Odg: Geode retry/acknowledge improvement

2021-05-05 Thread Mario Ivanac
Hi,

I think that we have problem that Darrel was suspicious, and that some kind of 
notification could be send from peer-to-peer to acknowledge that message is 
received on receiving side.

Regarding test with ip tables, execution gets stuck with conserve-sockets set 
to false or true.

BR,
Mario

Šalje: Darrel Schneider 
Poslano: 30. travnja 2021. 18:38
Prima: dev@geode.apache.org 
Predmet: Re: Odg: Geode retry/acknowledge improvement

In the geode hang you describe would the forced tcp-reset using iptables have 
cause the put send message to fail with an exception writing it to the socket? 
If so then I'd expect the geode Connection class to keep trying to send that 
message by creating a new connection to the member. It will keep doing this 
until the send is successful or the member leaves the cluster.

But if the tcp-reset allows the send to complete, without actually sending the 
request to the other member, then geode will be in trouble and will wait 
forever for a reply. Once geode successfully writes a p2p message on a socket, 
it expects it to be processed on the other side OR it expects the other side to 
leave the geode cluster. If neither of these happen then it will wait forever 
for a response. I've wondered in the past if this was a safe expectation. If 
not then do we need to send some type of msg id and after waiting for a reply 
for too long be able to check with the member to see if it has received the 
message we think we already sent?

You might see different behavior with your iptables test if you use 
conserve-sockets=false. In that case the socket used to write the p2p message 
is also used to read the response. But in the default conserve-sockets=true 
case, the reply comes on a different socket than the one used to send the 
message. It might be hard to get the thread doing the put for gfsh to use 
conserve-sockets=false. You could try just setting that on your server and the 
stuck thread stack should look different from what you are currently seeing.

From: Anthony Baker 
Sent: Friday, April 30, 2021 8:43 AM
To: dev@geode.apache.org 
Subject: Re: Odg: Geode retry/acknowledge improvement

Can you explain the scenario further?  Does the sidecar proxy both the sending 
and receiving socket (geode creates 2 sockets for each p2p member)?  In normal 
cases, closing these sockets should clear up any unacknowledged messages, 
freeing up the thread.

Anthony


> On Apr 20, 2021, at 7:31 AM, Mario Ivanac  wrote:
>
> Hi,
>
> after analysis, we  assume that proxy at reception of packets,  sends ACK on 
> TCP level, and after that moment proxy is restarted.
> This is the reason, we dont see tcp retries.
>
> Simular problem to this (but not packet loss), can be reproduce on geode,
> if on existing connection, after request is sent, tcp reset is received. In 
> that case, at reception of reset
> connection will be closed, and thread will get stuck while waiting on reply.
> I will add reproduction steps in ticket.
>
> 
> Šalje: Anthony Baker 
> Poslano: 19. travnja 2021. 22:54
> Prima: dev@geode.apache.org 
> Predmet: Re: Geode retry/acknowledge improvement
>
> Do you have a tcpdump that demonstrates the packet loss? How long did you 
> wait for TCP to retry the failed packet delivery (sometimes this can be 
> tweaked with tcp_retries2).  Does this manifest as a failed socket connection 
> in geode?  That ought to trigger some error handling IIRC.
>
> Anthony
>
>
>> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>>
>> Hi all,
>>
>> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
>> are injected between cluster members.
>> While running traffic, if any Istio/SideCar is restarted, thread will get 
>> stuck indefinitely, while waiting for reply on sent message.
>> It seams that due to restarting of proxy, in some cases, messages are lost, 
>> and sending side is waiting indefinitely for reply.
>>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9075data=04%7C01%7Cdarrel%40vmware.com%7C34dc38a12a744a5594a108d90beec365%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637553942381055798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=VBtRAp6cQx1FEN6h4vBrjcqr3Rxa98JBUBc2Jfl%2F5iU%3Dreserved=0
>>
>> My question is, what is your estimation, how much effort/work is needed to 
>> implement message retry/acknowledge logic in geode,
>> to solve this problem?
>>
>> BR,
>> Mario
>



Re: Odg: Geode retry/acknowledge improvement

2021-04-30 Thread Darrel Schneider
In the geode hang you describe would the forced tcp-reset using iptables have 
cause the put send message to fail with an exception writing it to the socket? 
If so then I'd expect the geode Connection class to keep trying to send that 
message by creating a new connection to the member. It will keep doing this 
until the send is successful or the member leaves the cluster.

But if the tcp-reset allows the send to complete, without actually sending the 
request to the other member, then geode will be in trouble and will wait 
forever for a reply. Once geode successfully writes a p2p message on a socket, 
it expects it to be processed on the other side OR it expects the other side to 
leave the geode cluster. If neither of these happen then it will wait forever 
for a response. I've wondered in the past if this was a safe expectation. If 
not then do we need to send some type of msg id and after waiting for a reply 
for too long be able to check with the member to see if it has received the 
message we think we already sent?

You might see different behavior with your iptables test if you use 
conserve-sockets=false. In that case the socket used to write the p2p message 
is also used to read the response. But in the default conserve-sockets=true 
case, the reply comes on a different socket than the one used to send the 
message. It might be hard to get the thread doing the put for gfsh to use 
conserve-sockets=false. You could try just setting that on your server and the 
stuck thread stack should look different from what you are currently seeing.

From: Anthony Baker 
Sent: Friday, April 30, 2021 8:43 AM
To: dev@geode.apache.org 
Subject: Re: Odg: Geode retry/acknowledge improvement

Can you explain the scenario further?  Does the sidecar proxy both the sending 
and receiving socket (geode creates 2 sockets for each p2p member)?  In normal 
cases, closing these sockets should clear up any unacknowledged messages, 
freeing up the thread.

Anthony


> On Apr 20, 2021, at 7:31 AM, Mario Ivanac  wrote:
>
> Hi,
>
> after analysis, we  assume that proxy at reception of packets,  sends ACK on 
> TCP level, and after that moment proxy is restarted.
> This is the reason, we dont see tcp retries.
>
> Simular problem to this (but not packet loss), can be reproduce on geode,
> if on existing connection, after request is sent, tcp reset is received. In 
> that case, at reception of reset
> connection will be closed, and thread will get stuck while waiting on reply.
> I will add reproduction steps in ticket.
>
> 
> Šalje: Anthony Baker 
> Poslano: 19. travnja 2021. 22:54
> Prima: dev@geode.apache.org 
> Predmet: Re: Geode retry/acknowledge improvement
>
> Do you have a tcpdump that demonstrates the packet loss? How long did you 
> wait for TCP to retry the failed packet delivery (sometimes this can be 
> tweaked with tcp_retries2).  Does this manifest as a failed socket connection 
> in geode?  That ought to trigger some error handling IIRC.
>
> Anthony
>
>
>> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>>
>> Hi all,
>>
>> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
>> are injected between cluster members.
>> While running traffic, if any Istio/SideCar is restarted, thread will get 
>> stuck indefinitely, while waiting for reply on sent message.
>> It seams that due to restarting of proxy, in some cases, messages are lost, 
>> and sending side is waiting indefinitely for reply.
>>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9075data=04%7C01%7Cdarrel%40vmware.com%7C34dc38a12a744a5594a108d90beec365%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637553942381055798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=VBtRAp6cQx1FEN6h4vBrjcqr3Rxa98JBUBc2Jfl%2F5iU%3Dreserved=0
>>
>> My question is, what is your estimation, how much effort/work is needed to 
>> implement message retry/acknowledge logic in geode,
>> to solve this problem?
>>
>> BR,
>> Mario
>



Re: Odg: Geode retry/acknowledge improvement

2021-04-30 Thread Anthony Baker
Can you explain the scenario further?  Does the sidecar proxy both the sending 
and receiving socket (geode creates 2 sockets for each p2p member)?  In normal 
cases, closing these sockets should clear up any unacknowledged messages, 
freeing up the thread.

Anthony


> On Apr 20, 2021, at 7:31 AM, Mario Ivanac  wrote:
> 
> Hi,
> 
> after analysis, we  assume that proxy at reception of packets,  sends ACK on 
> TCP level, and after that moment proxy is restarted.
> This is the reason, we dont see tcp retries.
> 
> Simular problem to this (but not packet loss), can be reproduce on geode,
> if on existing connection, after request is sent, tcp reset is received. In 
> that case, at reception of reset
> connection will be closed, and thread will get stuck while waiting on reply.
> I will add reproduction steps in ticket.
> 
> 
> Šalje: Anthony Baker 
> Poslano: 19. travnja 2021. 22:54
> Prima: dev@geode.apache.org 
> Predmet: Re: Geode retry/acknowledge improvement
> 
> Do you have a tcpdump that demonstrates the packet loss? How long did you 
> wait for TCP to retry the failed packet delivery (sometimes this can be 
> tweaked with tcp_retries2).  Does this manifest as a failed socket connection 
> in geode?  That ought to trigger some error handling IIRC.
> 
> Anthony
> 
> 
>> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>> 
>> Hi all,
>> 
>> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
>> are injected between cluster members.
>> While running traffic, if any Istio/SideCar is restarted, thread will get 
>> stuck indefinitely, while waiting for reply on sent message.
>> It seams that due to restarting of proxy, in some cases, messages are lost, 
>> and sending side is waiting indefinitely for reply.
>> 
>> https://issues.apache.org/jira/browse/GEODE-9075
>> 
>> My question is, what is your estimation, how much effort/work is needed to 
>> implement message retry/acknowledge logic in geode,
>> to solve this problem?
>> 
>> BR,
>> Mario
> 



Odg: Geode retry/acknowledge improvement

2021-04-30 Thread Mario Ivanac
Hi,

just reminding on this topic.

BR,
Mario

Šalje: Mario Ivanac 
Poslano: 20. travnja 2021. 16:31
Prima: dev@geode.apache.org 
Predmet: Odg: Geode retry/acknowledge improvement

Hi,

after analysis, we  assume that proxy at reception of packets,  sends ACK on 
TCP level, and after that moment proxy is restarted.
This is the reason, we dont see tcp retries.

Simular problem to this (but not packet loss), can be reproduce on geode,
if on existing connection, after request is sent, tcp reset is received. In 
that case, at reception of reset
connection will be closed, and thread will get stuck while waiting on reply.
I will add reproduction steps in ticket.


Šalje: Anthony Baker 
Poslano: 19. travnja 2021. 22:54
Prima: dev@geode.apache.org 
Predmet: Re: Geode retry/acknowledge improvement

Do you have a tcpdump that demonstrates the packet loss? How long did you wait 
for TCP to retry the failed packet delivery (sometimes this can be tweaked with 
tcp_retries2).  Does this manifest as a failed socket connection in geode?  
That ought to trigger some error handling IIRC.

Anthony


> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>
> Hi all,
>
> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
> are injected between cluster members.
> While running traffic, if any Istio/SideCar is restarted, thread will get 
> stuck indefinitely, while waiting for reply on sent message.
> It seams that due to restarting of proxy, in some cases, messages are lost, 
> and sending side is waiting indefinitely for reply.
>
> https://issues.apache.org/jira/browse/GEODE-9075
>
> My question is, what is your estimation, how much effort/work is needed to 
> implement message retry/acknowledge logic in geode,
> to solve this problem?
>
> BR,
> Mario



Odg: Geode retry/acknowledge improvement

2021-04-20 Thread Mario Ivanac
Hi,

after analysis, we  assume that proxy at reception of packets,  sends ACK on 
TCP level, and after that moment proxy is restarted.
This is the reason, we dont see tcp retries.

Simular problem to this (but not packet loss), can be reproduce on geode,
if on existing connection, after request is sent, tcp reset is received. In 
that case, at reception of reset
connection will be closed, and thread will get stuck while waiting on reply.
I will add reproduction steps in ticket.


Šalje: Anthony Baker 
Poslano: 19. travnja 2021. 22:54
Prima: dev@geode.apache.org 
Predmet: Re: Geode retry/acknowledge improvement

Do you have a tcpdump that demonstrates the packet loss? How long did you wait 
for TCP to retry the failed packet delivery (sometimes this can be tweaked with 
tcp_retries2).  Does this manifest as a failed socket connection in geode?  
That ought to trigger some error handling IIRC.

Anthony


> On Apr 19, 2021, at 7:16 AM, Mario Ivanac  wrote:
>
> Hi all,
>
> we have deployed geode cluster in kubernetes environment, and Istio/SideCars 
> are injected between cluster members.
> While running traffic, if any Istio/SideCar is restarted, thread will get 
> stuck indefinitely, while waiting for reply on sent message.
> It seams that due to restarting of proxy, in some cases, messages are lost, 
> and sending side is waiting indefinitely for reply.
>
> https://issues.apache.org/jira/browse/GEODE-9075
>
> My question is, what is your estimation, how much effort/work is needed to 
> implement message retry/acknowledge logic in geode,
> to solve this problem?
>
> BR,
> Mario