Hey David Thanks for looking into this. I tried your TestServer.java and I have the same observation. I am using gRPC 1.33.1 and Flight 1.0.0. As the next step, I will try to dump the stacktrace and share with you.
Chen On Fri, Sep 3, 2021 at 11:16 AM David Li <[email protected]> wrote: > So far, I've been unable to replicate the behavior you describe. I'm using > gRPC 1.30.2 here. > > What does the test script in > https://gist.github.com/lidavidm/80ea120a18cd7a4a5a881de874e2df7e do for > you? This sets up a client and server where the server has a very short max > connection age but a longer grace, then has the client make a series of > requests that take longer than the grace period. It also prints gRPC's > channel state in between RPCs. Hence, this would fail if the client isn't > properly reconnecting after each request. > > What I observe is that gRPC is reconnecting after each RPC as expected. > However, if you go over the grace period by adjusting the sleep call in the > RPC, then you do get an UNAVAILABLE error with GOAWAY/"no_error" so it > seems the gRPC client reports the *first* goaway's error message as the > error, which may be misleading. > > -David > > On Thu, Sep 2, 2021, at 19:31, David Li wrote: > > Got it - thanks for the clarification, good to know that Python (C++) is > behaving as expected here. > > BTW, I am a bit confused on the expected termination behavior you > described. Shouldn't the client terminate the connection after the > in-flight RPC is complete if any, instead of immediately after seeing the > first GOAWAY signal? > > > Right, what I mean is that gRPC/Java presumably shouldn't terminate the > call on the first GOAWAY, but should be routing future calls across a new > connection. Yet as you're observing, it's instead erroring right away. I'll > try to set aside some time soon to dig into this in the Java client. > > A stack trace would be helpful to just make sure I'm looking at the same > case you are, if that's not too much trouble. > > Best, > David > > On Thu, Sep 2, 2021, at 18:13, Chen Song wrote: > > Thanks for the reply, David. > > The use case is we enabled auto calling on Kubernetes, and expect the new > pods will be launched during load spike. As a result, we want the client to > be aware of this and connect to new pods. After research and > experimentation, we found that, by setting MAX_CONNECTION_AGE to a lower > duration, it will send GOAWAY signal to the client and the client will > refresh the IP list from dns/name resolver and update its connections to > the new list of pods. We set MAX_CONNECTION_AGE_GRACE to a very long > duration, like a day to ensure any in-flight RPC can be finished. This > seems to work for gRPC Python client in this setting. > > > but what you're seeing is that existing RPCs are getting interrupted > (likely because the client did not terminate the connection)? Does the > UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug > string "max_age") or a different one? (I'm wondering if the gRPC client is > not actually handling this case properly and terminating on the first > GOAWAY.) > The UNAVAILABLE error indicated the GOAWAY is the first one (with > *max_age* as the reason). I can try to pull the stacktrace within Flight > to see how the exception is popped up the chain if you think that is > helpful. > BTW, I am a bit confused on the expected termination behavior you > described. Shouldn't the client terminate the connection after the > in-flight RPC is complete if any, instead of immediately after seeing the > first GOAWAY signal? > > Best, > Chen > > On Wed, Sep 1, 2021 at 10:38 AM David Li <[email protected]> wrote: > > Hi Chen, > > Thanks for bringing this up. Flight doesn't have explicit handling for > this case in any implementation. What is the use case for enabling this? > > From what I see, what's desired is: > > - Existing RPCs should be able to finish, > - The connection should terminate, > - gRPC's connection management should automatically reconnect (and likely > get re-load-balanced) on the next call > > but what you're seeing is that existing RPCs are getting interrupted > (likely because the client did not terminate the connection)? Does the > UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug > string "max_age") or a different one? (I'm wondering if the gRPC client is > not actually handling this case properly and terminating on the first > GOAWAY.) > > If this sounds right, then when I get a chance, I'll do some more > investigation & report back/file a JIRA. > > -David > > On 2021/08/31 23:15:54, Chen Song <[email protected]> wrote: > > Recently, we turned on MAX_CONNECTION_AGE > > < > https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md#proposal > > > > on gRPC server. My understanding is that the server will send GOAWAY with > > status code NO_ERROR to the client but don't kill the connection until > > MAX_CONNECTION_AGE_GRACE is reached. In response, the client can properly > > handle this and continue with the current stream (w/o creating new > stream). > > > > We set the former to be 30 minutes and the latter to very long (like a > > day). However, we started seeing FlightRuntimeException with > > CallStatus=UNAVAILABLE thrown from the client wrapping the underlying > > GOAWAY error. It seems to be thrown from FlightStream.java > > < > https://arrow.apache.org/docs/java/reference/org/apache/arrow/flight/FlightStream.html#next-- > > > > . > > > > My question is, is this how Flight currently handles this type of error? > If > > so, can it be improved to handle it a better way (e.g., NO_ERROR meaning > > the current stream can continue)? > > > > BTW, I haven't dug too much but I don't seem to see this thrown > explicitly > > in plain gRPC client. > > > > Best, > > -- > > Chen Song > > > > > > -- > Chen Song > > > > -- Chen Song
