Thanks for the reply, David. The use case is we enabled auto calling on Kubernetes, and expect the new pods will be launched during load spike. As a result, we want the client to be aware of this and connect to new pods. After research and experimentation, we found that, by setting MAX_CONNECTION_AGE to a lower duration, it will send GOAWAY signal to the client and the client will refresh the IP list from dns/name resolver and update its connections to the new list of pods. We set MAX_CONNECTION_AGE_GRACE to a very long duration, like a day to ensure any in-flight RPC can be finished. This seems to work for gRPC Python client in this setting.
> but what you're seeing is that existing RPCs are getting interrupted (likely because the client did not terminate the connection)? Does the UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug string "max_age") or a different one? (I'm wondering if the gRPC client is not actually handling this case properly and terminating on the first GOAWAY.) The UNAVAILABLE error indicated the GOAWAY is the first one (with *max_age* as the reason). I can try to pull the stacktrace within Flight to see how the exception is popped up the chain if you think that is helpful. BTW, I am a bit confused on the expected termination behavior you described. Shouldn't the client terminate the connection after the in-flight RPC is complete if any, instead of immediately after seeing the first GOAWAY signal? Best, Chen On Wed, Sep 1, 2021 at 10:38 AM David Li <[email protected]> wrote: > Hi Chen, > > Thanks for bringing this up. Flight doesn't have explicit handling for > this case in any implementation. What is the use case for enabling this? > > From what I see, what's desired is: > > - Existing RPCs should be able to finish, > - The connection should terminate, > - gRPC's connection management should automatically reconnect (and likely > get re-load-balanced) on the next call > > but what you're seeing is that existing RPCs are getting interrupted > (likely because the client did not terminate the connection)? Does the > UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug > string "max_age") or a different one? (I'm wondering if the gRPC client is > not actually handling this case properly and terminating on the first > GOAWAY.) > > If this sounds right, then when I get a chance, I'll do some more > investigation & report back/file a JIRA. > > -David > > On 2021/08/31 23:15:54, Chen Song <[email protected]> wrote: > > Recently, we turned on MAX_CONNECTION_AGE > > < > https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md#proposal > > > > on gRPC server. My understanding is that the server will send GOAWAY with > > status code NO_ERROR to the client but don't kill the connection until > > MAX_CONNECTION_AGE_GRACE is reached. In response, the client can properly > > handle this and continue with the current stream (w/o creating new > stream). > > > > We set the former to be 30 minutes and the latter to very long (like a > > day). However, we started seeing FlightRuntimeException with > > CallStatus=UNAVAILABLE thrown from the client wrapping the underlying > > GOAWAY error. It seems to be thrown from FlightStream.java > > < > https://arrow.apache.org/docs/java/reference/org/apache/arrow/flight/FlightStream.html#next-- > > > > . > > > > My question is, is this how Flight currently handles this type of error? > If > > so, can it be improved to handle it a better way (e.g., NO_ERROR meaning > > the current stream can continue)? > > > > BTW, I haven't dug too much but I don't seem to see this thrown > explicitly > > in plain gRPC client. > > > > Best, > > -- > > Chen Song > > > -- Chen Song
