Hi Seye, It seems that you have conducted an in-depth analysis of this issue. If you think it's a bug or need improvement. Please feel free to create a JIRA issue to track its status.
Thanks, vino. Seye Jin <seyej...@gmail.com> 于2018年10月14日周日 上午12:02写道: > I recently upgraded to flink 1.4 from 1.3 and leverage Queryable State > client in my application. I have 1 jm and 5 tm all serviced behind > kubernetes. A large state is built and distributed evenly across task > mangers and the client can query state for specified key > > Issue: if a task manager dies and a new one gets spun up(automatically) > and the QS states successfully recover in new nodes/task slots. I start to > get time out exception when the client tries to query for key, even if I > try to reset or re-deploy the client jobs > > I have been trying to triage this and figure out a way to remediate this > issue and I found that in KvStateClientProxyHandler which is not exposed in > code, there is a forceUpdate flag that can help reset KvStateLocations(plus > inetAddresses) but the default is false and can't be overriden > > I was wandering if anyone knows how to remediate this kind of issue or if > there is a way to have the jobmanager know that the task manager location > in cache is no more valid. > > Any tip to resolve this will be appreciated (I can't downgrade back to 1.3 > or upgrade from 1.4) > >