[ https://issues.apache.org/jira/browse/STORM-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Li resolved STORM-3763. ----------------------------- Fix Version/s: 2.2.1 2.3.0 Resolution: Fixed Merged to master: 4a1eea700766da2f175ac7eaba6064f0d7f0ff03 Cherry-picked to 2.2.x-branch: 7543a13e021570b6b4d7e583e00823c0b3106a3b > Backpressure message ignored by the receiver caused the topology to not > progress > -------------------------------------------------------------------------------- > > Key: STORM-3763 > URL: https://issues.apache.org/jira/browse/STORM-3763 > Project: Apache Storm > Issue Type: Bug > Affects Versions: 2.0.0, 2.1.0, 2.2.0 > Reporter: Ethan Li > Assignee: Ethan Li > Priority: Major > Fix For: 2.3.0, 2.2.1 > > Time Spent: 20m > Remaining Estimate: 0h > > We have noticed a case where topology is stuck due to the mis-interpretation > of backpressure messge: > At beginning, the topology ran fine but a downstream component had > backpressure, so it sent backpressure signal to its upstream component, and > the upstream component paused sending data to the downstream bolt. > Then the downstream component restarted (due to any reason, for example, > killed by supervisor due to heartbeat timeout). When it came back up, it > sends backpressure message to the upstream bolt. However, the upstream > component didn't know how to interpret the backpressure message so it logs > the below error and ignores the message. > {code:java} > 2021-01-28 19:41:37.175 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [ERROR] Unexpected message from server: > {worker=4c38160a-3c66-4eff-8572-2d0c493bd6c1, bpStatusId=254, bpTasks=[], > nonBpTasks=[546, 790, 863]} > {code} > Then the downstream component will not receive any data from the upstream > component, so it won't have any backpressure (since no data is sent to it), > hence it won't send any backpressure update message to the upstream > component. This leads to a dead situation that the upstream component thinks > the downstream has backpressure so it paused sending data to it, while the > downstream doesn't have backpressure but can't receive any data from > upstream. The topology is stuck because of it. > Let's look at the code: > When the connection between the downstream (server) and upstream (client) is > established, > server invokes > https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerHandler.java#L39-L41 > https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java#L237 > which sends backpressure messages to the client. > This is because in this pipeline, "StormServerHandler" is the only one with > that implemented "channelActive()" method. > https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerPipelineFactory.java#L56 > However, the Client side expects authentication messages. > https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormClientHandler.java#L70-L75 > so the client can't interpret the backpressure message at the beginning, > hence "unexpected message". > This can be supported with an example. I have a wordcount topology running. > At the startup, the client tries to connect to the server. Once connected, it > sends a "SASL_TOKEN_MESSAGE_REQUEST". > client log > {code:java} > 021-01-29 19:03:21.355 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [DEBUG] SASL credentials for storm topology wc is > -8603731884381183101:-9091319821854384981 > 2021-01-29 19:03:21.359 o.a.s.m.n.Client client-worker-1 [DEBUG] successfully > connected to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702, [id: > 0x29da2e9c, L:/10.215.73.209:45870 - > R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] [attempt 12] > 2021-01-29 19:03:21.359 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [INFO] Connection established from /10.215.73.209:45870 to > openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702 > ... > 2021-01-29 19:03:21.362 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [DEBUG] Creating saslNettyClient now for channel: [id: 0x29da2e9c, > L:/10.215.73.209:45870 - > R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] > 2021-01-29 19:03:21.363 o.a.s.m.n.SaslNettyClient client-worker-1 [DEBUG] > SaslNettyClient: Creating SASL DIGEST-MD5 client to authenticate to server > 2021-01-29 19:03:21.368 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [DEBUG] Sending SASL_TOKEN_MESSAGE_REQUEST > ... > 2021-01-29 19:03:21.632 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [DEBUG] send/recv time (ms): 277 > 2021-01-29 19:03:21.633 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [ERROR] Unexpected message from server: > {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[], > nonBpTasks=[17, 1, 18, 3, 4, 22, 7, 8, 9, 12, > 13]} > {code} > But the server sends the backpressure message first, before it deals with the > SASL_TOKEN_MESSAGE_REQUEST message > server log > {code:java} > 2021-01-29 19:03:21.473 o.a.s.m.n.SaslStormServerHandler > Netty-server-localhost-6702-worker-1 [DEBUG] SASL credentials for storm > topology wc is -8603731884381183101:-9091319821854384981 > 2021-01-29 19:03:21.482 o.a.s.u.Utils main [DEBUG] Using storm.yaml from > resources > 2021-01-29 19:03:21.490 o.a.s.d.w.WorkerState > Netty-server-localhost-6702-worker-1 [INFO] Sending BackPressure status to > new client. BPStatus: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, > bpStatusId=1, bpTasks=[], nonBpTasks=[17, > 1, 18, 3, 4, 22, 7, 8, 9, 12, 13]} > 2021-01-29 19:03:21.510 o.a.s.m.n.SaslStormClientHandler client-worker-1 > [DEBUG] SASL credentials for storm topology wc is > -8603731884381183101:-9091319821854384981 > 2021-01-29 19:03:21.572 o.a.s.s.i.n.u.Recycler > Netty-server-localhost-6702-worker-1 [DEBUG] > -Dio.netty.recycler.maxCapacityPerThread: 4096 > 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler > Netty-server-localhost-6702-worker-1 [DEBUG] > -Dio.netty.recycler.maxSharedCapacityFactor: 2 > 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler > Netty-server-localhost-6702-worker-1 [DEBUG] > -Dio.netty.recycler.linkCapacity: 16 > 2021-01-29 19:03:21.574 o.a.s.s.i.n.u.Recycler > Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.ratio: 8 > 2021-01-29 19:03:21.575 o.a.s.v.ConfigValidation main [WARN] > topology.backpressure.enable is a deprecated config please see class > org.apache.storm.Config.TOPOLOGY_BACKPRESSURE_ENABLE for more information. > 2021-01-29 19:03:21.593 o.a.s.s.i.n.b.AbstractByteBuf > Netty-server-localhost-6702-worker-1 [DEBUG] > -Dorg.apache.storm.shade.io.netty.buffer.checkAccessible: true > 2021-01-29 19:03:21.594 o.a.s.s.i.n.b.AbstractByteBuf > Netty-server-localhost-6702-worker-1 [DEBUG] > -Dorg.apache.storm.shade.io.netty.buffer.checkBounds: true > 2021-01-29 19:03:21.594 o.a.s.s.i.n.u.ResourceLeakDetectorFactory > Netty-server-localhost-6702-worker-1 [DEBUG] Loaded default > ResourceLeakDetector: > org.apache.storm.shade.io.netty.util.ResourceLeakDetector@524a134b > .... > 2021-01-29 19:03:21.695 o.a.s.m.n.SaslStormServerHandler > Netty-server-localhost-6702-worker-1 [DEBUG] No saslNettyServer for [id: > 0x6fa65bc5, L:/10.215.73.209:6702 - R:/10.215.73.209:45870] yet; creating > now, with topology token: wc > 2021-01-29 19:03:21.697 o.a.s.m.n.SaslNettyServer > Netty-server-localhost-6702-worker-1 [DEBUG] SaslNettyServer: Topology token > is: wc with authmethod DIGEST-MD5 > 2021-01-29 19:03:21.698 o.a.s.m.n.SaslNettyServer > Netty-server-localhost-6702-worker-1 [DEBUG] SaslDigestCallback: Creating > SaslDigestCallback handler with topology token: wc > {code} > https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormServerHandler.java#L46-L52 > This is a bug likely introduced in STORM-2306 > (https://github.com/apache/storm/pull/2502). This willl happen on every > topology when "storm.messaging.netty.authentication" is set true (It is false > by default) -- This message was sent by Atlassian Jira (v8.3.4#803005)