So far, I've been unable to replicate the behavior you describe. I'm using gRPC 1.30.2 here.
What does the test script in https://gist.github.com/lidavidm/80ea120a18cd7a4a5a881de874e2df7e do for you? This sets up a client and server where the server has a very short max connection age but a longer grace, then has the client make a series of requests that take longer than the grace period. It also prints gRPC's channel state in between RPCs. Hence, this would fail if the client isn't properly reconnecting after each request. What I observe is that gRPC is reconnecting after each RPC as expected. However, if you go over the grace period by adjusting the sleep call in the RPC, then you do get an UNAVAILABLE error with GOAWAY/"no_error" so it seems the gRPC client reports the *first* goaway's error message as the error, which may be misleading. -David On Thu, Sep 2, 2021, at 19:31, David Li wrote: > Got it - thanks for the clarification, good to know that Python (C++) is > behaving as expected here. > >> BTW, I am a bit confused on the expected termination behavior you described. >> Shouldn't the client terminate the connection after the in-flight RPC is >> complete if any, instead of immediately after seeing the first GOAWAY signal? > > Right, what I mean is that gRPC/Java presumably shouldn't terminate the call > on the first GOAWAY, but should be routing future calls across a new > connection. Yet as you're observing, it's instead erroring right away. I'll > try to set aside some time soon to dig into this in the Java client. > > A stack trace would be helpful to just make sure I'm looking at the same case > you are, if that's not too much trouble. > > Best, > David > > On Thu, Sep 2, 2021, at 18:13, Chen Song wrote: >> Thanks for the reply, David. >> >> The use case is we enabled auto calling on Kubernetes, and expect the new >> pods will be launched during load spike. As a result, we want the client to >> be aware of this and connect to new pods. After research and >> experimentation, we found that, by setting MAX_CONNECTION_AGE to a lower >> duration, it will send GOAWAY signal to the client and the client will >> refresh the IP list from dns/name resolver and update its connections to the >> new list of pods. We set MAX_CONNECTION_AGE_GRACE to a very long duration, >> like a day to ensure any in-flight RPC can be finished. This seems to work >> for gRPC Python client in this setting. >> >> > but what you're seeing is that existing RPCs are getting interrupted >> > (likely because the client did not terminate the connection)? Does the >> > UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug >> > string "max_age") or a different one? (I'm wondering if the gRPC client is >> > not actually handling this case properly and terminating on the first >> > GOAWAY.) >> The UNAVAILABLE error indicated the GOAWAY is the first one (with *max_age* >> as the reason). I can try to pull the stacktrace within Flight to see how >> the exception is popped up the chain if you think that is helpful. >> BTW, I am a bit confused on the expected termination behavior you described. >> Shouldn't the client terminate the connection after the in-flight RPC is >> complete if any, instead of immediately after seeing the first GOAWAY signal? >> >> Best, >> Chen >> >> On Wed, Sep 1, 2021 at 10:38 AM David Li <[email protected]> wrote: >>> Hi Chen, >>> >>> Thanks for bringing this up. Flight doesn't have explicit handling for this >>> case in any implementation. What is the use case for enabling this? >>> >>> From what I see, what's desired is: >>> >>> - Existing RPCs should be able to finish, >>> - The connection should terminate, >>> - gRPC's connection management should automatically reconnect (and likely >>> get re-load-balanced) on the next call >>> >>> but what you're seeing is that existing RPCs are getting interrupted >>> (likely because the client did not terminate the connection)? Does the >>> UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug >>> string "max_age") or a different one? (I'm wondering if the gRPC client is >>> not actually handling this case properly and terminating on the first >>> GOAWAY.) >>> >>> If this sounds right, then when I get a chance, I'll do some more >>> investigation & report back/file a JIRA. >>> >>> -David >>> >>> On 2021/08/31 23:15:54, Chen Song <[email protected]> wrote: >>> > Recently, we turned on MAX_CONNECTION_AGE >>> > <https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md#proposal> >>> > on gRPC server. My understanding is that the server will send GOAWAY with >>> > status code NO_ERROR to the client but don't kill the connection until >>> > MAX_CONNECTION_AGE_GRACE is reached. In response, the client can properly >>> > handle this and continue with the current stream (w/o creating new >>> > stream). >>> > >>> > We set the former to be 30 minutes and the latter to very long (like a >>> > day). However, we started seeing FlightRuntimeException with >>> > CallStatus=UNAVAILABLE thrown from the client wrapping the underlying >>> > GOAWAY error. It seems to be thrown from FlightStream.java >>> > <https://arrow.apache.org/docs/java/reference/org/apache/arrow/flight/FlightStream.html#next--> >>> > . >>> > >>> > My question is, is this how Flight currently handles this type of error? >>> > If >>> > so, can it be improved to handle it a better way (e.g., NO_ERROR meaning >>> > the current stream can continue)? >>> > >>> > BTW, I haven't dug too much but I don't seem to see this thrown explicitly >>> > in plain gRPC client. >>> > >>> > Best, >>> > -- >>> > Chen Song >>> > >> >> >> -- >> Chen Song >> >
