Hey David

Thanks for looking into this. I tried your TestServer.java and I have the
same observation. I am using gRPC 1.33.1 and Flight 1.0.0. As the next
step, I will try to dump the stacktrace and share with you.

Chen

On Fri, Sep 3, 2021 at 11:16 AM David Li <[email protected]> wrote:

> So far, I've been unable to replicate the behavior you describe. I'm using
> gRPC 1.30.2 here.
>
> What does the test script in
> https://gist.github.com/lidavidm/80ea120a18cd7a4a5a881de874e2df7e do for
> you? This sets up a client and server where the server has a very short max
> connection age but a longer grace, then has the client make a series of
> requests that take longer than the grace period. It also prints gRPC's
> channel state in between RPCs. Hence, this would fail if the client isn't
> properly reconnecting after each request.
>
> What I observe is that gRPC is reconnecting after each RPC as expected.
> However, if you go over the grace period by adjusting the sleep call in the
> RPC, then you do get an UNAVAILABLE error with GOAWAY/"no_error" so it
> seems the gRPC client reports the *first* goaway's error message as the
> error, which may be misleading.
>
> -David
>
> On Thu, Sep 2, 2021, at 19:31, David Li wrote:
>
> Got it - thanks for the clarification, good to know that Python (C++) is
> behaving as expected here.
>
> BTW, I am a bit confused on the expected termination behavior you
> described. Shouldn't the client terminate the connection after the
> in-flight RPC is complete if any, instead of immediately after seeing the
> first GOAWAY signal?
>
>
> Right, what I mean is that gRPC/Java presumably shouldn't terminate the
> call on the first GOAWAY, but should be routing future calls across a new
> connection. Yet as you're observing, it's instead erroring right away. I'll
> try to set aside some time soon to dig into this in the Java client.
>
> A stack trace would be helpful to just make sure I'm looking at the same
> case you are, if that's not too much trouble.
>
> Best,
> David
>
> On Thu, Sep 2, 2021, at 18:13, Chen Song wrote:
>
> Thanks for the reply, David.
>
> The use case is we enabled auto calling on Kubernetes, and expect the new
> pods will be launched during load spike. As a result, we want the client to
> be aware of this and connect to new pods. After research and
> experimentation, we found that, by setting MAX_CONNECTION_AGE to a lower
> duration, it will send GOAWAY signal to the client and the client will
> refresh the IP list from dns/name resolver and update its connections to
> the new list of pods. We set MAX_CONNECTION_AGE_GRACE to a very long
> duration, like a day to ensure any in-flight RPC can be finished. This
> seems to work for gRPC Python client in this setting.
>
> > but what you're seeing is that existing RPCs are getting interrupted
> (likely because the client did not terminate the connection)? Does the
> UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug
> string "max_age") or a different one? (I'm wondering if the gRPC client is
> not actually handling this case properly and terminating on the first
> GOAWAY.)
> The UNAVAILABLE error indicated the GOAWAY is the first one (with
> *max_age* as the reason). I can try to pull the stacktrace within Flight
> to see how the exception is popped up the chain if you think that is
> helpful.
> BTW, I am a bit confused on the expected termination behavior you
> described. Shouldn't the client terminate the connection after the
> in-flight RPC is complete if any, instead of immediately after seeing the
> first GOAWAY signal?
>
> Best,
> Chen
>
> On Wed, Sep 1, 2021 at 10:38 AM David Li <[email protected]> wrote:
>
> Hi Chen,
>
> Thanks for bringing this up. Flight doesn't have explicit handling for
> this case in any implementation. What is the use case for enabling this?
>
> From what I see, what's desired is:
>
> - Existing RPCs should be able to finish,
> - The connection should terminate,
> - gRPC's connection management should automatically reconnect (and likely
> get re-load-balanced) on the next call
>
> but what you're seeing is that existing RPCs are getting interrupted
> (likely because the client did not terminate the connection)? Does the
> UNAVAILABLE error indicate whether the GOAWAY is the first one (with debug
> string "max_age") or a different one? (I'm wondering if the gRPC client is
> not actually handling this case properly and terminating on the first
> GOAWAY.)
>
> If this sounds right, then when I get a chance, I'll do some more
> investigation & report back/file a JIRA.
>
> -David
>
> On 2021/08/31 23:15:54, Chen Song <[email protected]> wrote:
> > Recently, we turned on MAX_CONNECTION_AGE
> > <
> https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md#proposal
> >
> > on gRPC server. My understanding is that the server will send GOAWAY with
> > status code NO_ERROR to the client but don't kill the connection until
> > MAX_CONNECTION_AGE_GRACE is reached. In response, the client can properly
> > handle this and continue with the current stream (w/o creating new
> stream).
> >
> > We set the former to be 30 minutes and the latter to very long (like a
> > day). However, we started seeing FlightRuntimeException with
> > CallStatus=UNAVAILABLE thrown from the client wrapping the underlying
> > GOAWAY error. It seems to be thrown from FlightStream.java
> > <
> https://arrow.apache.org/docs/java/reference/org/apache/arrow/flight/FlightStream.html#next--
> >
> > .
> >
> > My question is, is this how Flight currently handles this type of error?
> If
> > so, can it be improved to handle it a better way (e.g., NO_ERROR meaning
> > the current stream can continue)?
> >
> > BTW, I haven't dug too much but I don't seem to see this thrown
> explicitly
> > in plain gRPC client.
> >
> > Best,
> > --
> > Chen Song
> >
>
>
>
> --
> Chen Song
>
>
>
>

-- 
Chen Song

Reply via email to