Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Binh Nguyen Van Mon, 15 Jun 2015 21:34:08 -0700

Can you try this.

Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so that
YGC happen once every 2-3 seconds?
If that fix the issue then I think GC is the cause of your problem.


On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <[email protected]> wrote:

> We use storm bare bones, not trident as it's too expensive for our use
> cases.  The jvm options for supervisor is listed below but it might not be
> optimal in any sense.
>
> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6
> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000
> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions
> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false"
>
>
>
> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <[email protected]>
> wrote:
>
>> Just to be sure, are you using Storm or Storm Trident?
>> Also can you share the current setting of your supervisor.child_opts?
>>
>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <[email protected]> wrote:
>>
>>> I did enable gc for both worker and supervisor and found nothing
>>> abnormal (pause is minimal and frequency is normal too).  I tried max
>>> spound pending of both 1000 and 500.
>>>
>>> Fang
>>>
>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]>
>>> wrote:
>>>
>>>> Hi Fang,
>>>>
>>>> Did you check your GC log? Do you see anything abnormal?
>>>> What is your current max spout pending setting?
>>>>
>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote:
>>>>
>>>>> I also did this and find no success.
>>>>>
>>>>> Thanks,
>>>>> Fang
>>>>>
>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> After I wrote that I realized you tried empty topology anyways.  This
>>>>>> should reduce any gc or worker initialization related failures though 
>>>>>> they
>>>>>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if 
>>>>>> this
>>>>>> is still required but it used to be helpful to make sure your storm nodes
>>>>>> have each other listed in /etc/hosts.
>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote:
>>>>>>
>>>>>>> Make sure your topology is starting up in the allotted time, and if
>>>>>>> not try increasing the startup timeout.
>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Erik
>>>>>>>>
>>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. 
>>>>>>>> The
>>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>>> themselves ( i actually tried killing the supervisor and the worker 
>>>>>>>> would
>>>>>>>> continue running beyond 30 minutes).
>>>>>>>>
>>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor 
>>>>>>>> shuts
>>>>>>>> down its worker --> other peer workers complain about netty issues. 
>>>>>>>> and the
>>>>>>>> loop goes on.
>>>>>>>>
>>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>>
>>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>> Fang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hey Fang,
>>>>>>>>>
>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of
>>>>>>>>> 30+ nodes.
>>>>>>>>>
>>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>>> decided to restart your worker processes.  It would happen when 
>>>>>>>>> workers die
>>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There are 
>>>>>>>>> logs
>>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers died 
>>>>>>>>> by
>>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>>
>>>>>>>>> - Erik
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), 
>>>>>>>>>> and
>>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes 
>>>>>>>>>> on a
>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes 
>>>>>>>>>> but on
>>>>>>>>>> different disks.
>>>>>>>>>>
>>>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy 
>>>>>>>>>> bolt,
>>>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>>>
>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field 
>>>>>>>>>> grouping,
>>>>>>>>>> msg size=1k).
>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it
>>>>>>>>>> believe the topology is dead, then after another 2 minutes, another 
>>>>>>>>>> kill,
>>>>>>>>>> then another after another 4 minutes, and on and on.
>>>>>>>>>>
>>>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any 
>>>>>>>>>> doable
>>>>>>>>>> workarounds? I wish there are as so many of you are using it in 
>>>>>>>>>> production
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>>> topology working!
>>>>>>>>>>
>>>>>>>>>> Fang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Reply via email to