Can you try this. Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so that YGC happen once every 2-3 seconds? If that fix the issue then I think GC is the cause of your problem.
On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <[email protected]> wrote: > We use storm bare bones, not trident as it's too expensive for our use > cases. The jvm options for supervisor is listed below but it might not be > optimal in any sense. > > supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6 > -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000 > -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions > -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent > -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution > -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure > -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC > -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 > -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998 > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false" > > > > On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <[email protected]> > wrote: > >> Just to be sure, are you using Storm or Storm Trident? >> Also can you share the current setting of your supervisor.child_opts? >> >> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <[email protected]> wrote: >> >>> I did enable gc for both worker and supervisor and found nothing >>> abnormal (pause is minimal and frequency is normal too). I tried max >>> spound pending of both 1000 and 500. >>> >>> Fang >>> >>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]> >>> wrote: >>> >>>> Hi Fang, >>>> >>>> Did you check your GC log? Do you see anything abnormal? >>>> What is your current max spout pending setting? >>>> >>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote: >>>> >>>>> I also did this and find no success. >>>>> >>>>> Thanks, >>>>> Fang >>>>> >>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> >>>>> wrote: >>>>> >>>>>> After I wrote that I realized you tried empty topology anyways. This >>>>>> should reduce any gc or worker initialization related failures though >>>>>> they >>>>>> are still possible. As Erik mentioned check ZK. Also I'm not sure if >>>>>> this >>>>>> is still required but it used to be helpful to make sure your storm nodes >>>>>> have each other listed in /etc/hosts. >>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote: >>>>>> >>>>>>> Make sure your topology is starting up in the allotted time, and if >>>>>>> not try increasing the startup timeout. >>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote: >>>>>>> >>>>>>>> Hi Erik >>>>>>>> >>>>>>>> Thanks for your reply! It's great to hear about real production >>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. >>>>>>>> The >>>>>>>> initial investigation seems to indicate that workers don't die by >>>>>>>> themselves ( i actually tried killing the supervisor and the worker >>>>>>>> would >>>>>>>> continue running beyond 30 minutes). >>>>>>>> >>>>>>>> The sequence of events is like this: supervisor immediately >>>>>>>> complains worker "still has not started" for a few seconds right after >>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus >>>>>>>> complains executors (related to the worker) "not alive" and started to >>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor >>>>>>>> shuts >>>>>>>> down its worker --> other peer workers complain about netty issues. >>>>>>>> and the >>>>>>>> loop goes on. >>>>>>>> >>>>>>>> Could you kindly tell me what version of zookeeper is used with >>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster? >>>>>>>> >>>>>>>> I wonder if this is due to zookeeper issues. >>>>>>>> >>>>>>>> Thanks a lot, >>>>>>>> Fang >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hey Fang, >>>>>>>>> >>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of >>>>>>>>> 30+ nodes. >>>>>>>>> >>>>>>>>> One of the challenges with storm is figuring out what the root >>>>>>>>> cause is when things go haywire. You'll wanna examine why the nimbus >>>>>>>>> decided to restart your worker processes. It would happen when >>>>>>>>> workers die >>>>>>>>> and the nimbus notices that storm executors aren't alive. (There are >>>>>>>>> logs >>>>>>>>> in nimbus for this.) Then you'll wanna dig into why the workers died >>>>>>>>> by >>>>>>>>> looking at logs on the worker hosts. >>>>>>>>> >>>>>>>>> - Erik >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), >>>>>>>>>> and >>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes >>>>>>>>>> on a >>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes >>>>>>>>>> but on >>>>>>>>>> different disks. >>>>>>>>>> >>>>>>>>>> I have huge troubles to give my data analytics topology a stable >>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy >>>>>>>>>> bolt, >>>>>>>>>> no io except for reading from kafka queue. >>>>>>>>>> >>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field >>>>>>>>>> grouping, >>>>>>>>>> msg size=1k). >>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it >>>>>>>>>> believe the topology is dead, then after another 2 minutes, another >>>>>>>>>> kill, >>>>>>>>>> then another after another 4 minutes, and on and on. >>>>>>>>>> >>>>>>>>>> I can understand there might be issues in the coordination among >>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any >>>>>>>>>> doable >>>>>>>>>> workarounds? I wish there are as so many of you are using it in >>>>>>>>>> production >>>>>>>>>> :-) >>>>>>>>>> >>>>>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>>>>> topology working! >>>>>>>>>> >>>>>>>>>> Fang >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>> >>>> >>> >> >
