Thanks for the reply, David.

The use case is we enabled auto calling on Kubernetes, and expect the new
pods will be launched during load spike. As a result, we want the client to
be aware of this and connect to new pods. After research and
experimentation, we found that, by setting MAX_CONNECTION_AGE to a lower
duration, it will send GOAWAY signal to the client and the client will
refresh the IP list from dns/name resolver and update its connections to
the new list of pods. We set MAX_CONNECTION_AGE_GRACE to a very long
duration, like a day to ensure any in-flight RPC can be finished. This
seems to work for gRPC Python client in this setting.

> but what you're seeing is that existing RPCs are getting interrupted
(likely because the client did not terminate the connection)? Does the
UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug
string "max_age") or a different one? (I'm wondering if the gRPC client is
not actually handling this case properly and terminating on the first
GOAWAY.)
The UNAVAILABLE error indicated the GOAWAY is the first one (with *max_age*
as the reason). I can try to pull the stacktrace within Flight to see how
the exception is popped up the chain if you think that is helpful.
BTW, I am a bit confused on the expected termination behavior you
described. Shouldn't the client terminate the connection after the
in-flight RPC is complete if any, instead of immediately after seeing the
first GOAWAY signal?

Best,
Chen

On Wed, Sep 1, 2021 at 10:38 AM David Li <[email protected]> wrote:

> Hi Chen,
>
> Thanks for bringing this up. Flight doesn't have explicit handling for
> this case in any implementation. What is the use case for enabling this?
>
> From what I see, what's desired is:
>
> - Existing RPCs should be able to finish,
> - The connection should terminate,
> - gRPC's connection management should automatically reconnect (and likely
> get re-load-balanced) on the next call
>
> but what you're seeing is that existing RPCs are getting interrupted
> (likely because the client did not terminate the connection)? Does the
> UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug
> string "max_age") or a different one? (I'm wondering if the gRPC client is
> not actually handling this case properly and terminating on the first
> GOAWAY.)
>
> If this sounds right, then when I get a chance, I'll do some more
> investigation & report back/file a JIRA.
>
> -David
>
> On 2021/08/31 23:15:54, Chen Song <[email protected]> wrote:
> > Recently, we turned on MAX_CONNECTION_AGE
> > <
> https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md#proposal
> >
> > on gRPC server. My understanding is that the server will send GOAWAY with
> > status code NO_ERROR to the client but don't kill the connection until
> > MAX_CONNECTION_AGE_GRACE is reached. In response, the client can properly
> > handle this and continue with the current stream (w/o creating new
> stream).
> >
> > We set the former to be 30 minutes and the latter to very long (like a
> > day). However, we started seeing FlightRuntimeException with
> > CallStatus=UNAVAILABLE thrown from the client wrapping the underlying
> > GOAWAY error. It seems to be thrown from FlightStream.java
> > <
> https://arrow.apache.org/docs/java/reference/org/apache/arrow/flight/FlightStream.html#next--
> >
> > .
> >
> > My question is, is this how Flight currently handles this type of error?
> If
> > so, can it be improved to handle it a better way (e.g., NO_ERROR meaning
> > the current stream can continue)?
> >
> > BTW, I haven't dug too much but I don't seem to see this thrown
> explicitly
> > in plain gRPC client.
> >
> > Best,
> > --
> > Chen Song
> >
>


-- 
Chen Song

Reply via email to