Thanks for your reply. We have solve the problem. I think the main problem is GC caused. We changed the worker.childopts params and the problem never happens.
------------------------------------------------ before config: -Xmx 1024m ------------------------------------------------ after config: -Xmx2600m -Xms2600m -Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:NewSize=1000m -XX:MaxNewSize=1000m -XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -server -XX:+AggressiveOpts -XX:+UseCompressedOops -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Xloggc:logs/gc-worker-%ID%.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:-PrintGCApplicationStoppedTime -XX:-PrintGCApplicationConcurrentTime -XX:+PrintCommandLineFlags -XX:+PrintFlagsFinal On Thu, Aug 21, 2014 at 6:16 AM, Derek Dagit <[email protected]> wrote: > 1) does it related to GC problem? >> > > This is usually the cause. As the worker consumes more and more of the > heap, gc takes longer and longer each time. Eventually it takes so long > the heartbeats to the supervisor do not happen. > > There could be a spike or skew in your data such that one or more workers > cannot handle it with their heap settings. > > -- > Derek > > > On 8/20/14, 5:26, DashengJu wrote: > >> hi, all >> >> In out production environment, we have a topology named >> logparser_mobile_nginx,it has 50 worker, spout have 48 executors, >> bolt_parser have 1000 executors and bolt_saver have 50 executors. >> >> The topology running normal most of times, but 1~5 workers restarted every >> 1~2 hours. When we see the logs of supervisor and worker, found 1) worker >> have no error or exception; 2) supervisor says the worker did not do >> heartbeat and timeout happened. >> >> because worker have no log, I do not know why worker did not do heartbeat. >> anyone have any ideas how to investigate? >> 0) is the worker exist caused? >> 1) does it related to GC problem? >> 2) does it related to Memory problem? If this, I think the JVM will report >> Memory Exception in worker log. >> >> By the way, some small topologies works well on the same environment. >> >> below is the supervisor log: >> ------------------------------------------------------------ >> ---------------------------------- >> 2014-08-20 15:51:33 b.s.d.supervisor [INFO] >> 90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started >> >> 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing >> state for id c7e8d375-db76-4e2 >> 9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State: >> :timed-out, Heartbeat: #backtype.sto >> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id >> "logparser_mobile_nginx-259-1408518 >> 662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]}, >> :port 9714} >> 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3 >> 75-db76-4e29-8019-e783ab3cd6de >> 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901. >> Process is probably already dead >> . >> 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921. >> Process is probably already dead >> . >> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d >> b76-4e29-8019-e783ab3cd6de >> >> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing >> state for id d5a8d578-89ff-4a5 >> 0-a906-75e847ac63a1. Current supervisor time: 1408521676. State: >> :timed-out, Heartbeat: #backtype.sto >> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id >> "logparser_nginx-265-1408521077", : >> executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700} >> 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5 >> 78-89ff-4a50-a906-75e847ac63a1 >> 2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068. >> Process is probably already dead >> . >> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8 >> 9ff-4a50-a906-75e847ac63a1 >> >> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing >> state for id 5154f643-cd79-411 >> 9-9368-153f1bede757. Current supervisor time: 1408521676. State: >> :timed-out, Heartbeat: #backtype.sto >> rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id >> "logparser_mobile_nginx-259-1408518 >> 662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]}, >> :port 9720} >> 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6 >> 43-cd79-4119-9368-153f1bede757 >> 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976. >> Process is probably already dead. >> 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986. >> Process is probably already dead. >> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757 >> >> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing >> state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor >> time: >> 1408521676. State: :timed-out, Heartbeat: >> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, >> :storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] >> [24 >> 24] [-1 -1]}, :port 9713} >> 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba >> 2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177. >> Process is probably already dead. >> 2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down >> 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba >> >> -- dashengju +86 13810875910 [email protected]
