Re: Tuning and nimbus at 99%

Sean Solbak Mon, 03 Mar 2014 21:06:06 -0800

The hard drive was at 18%.

If its not disk space related, it must be some kind of memory overflow?
 Hard to say as nothing was running yet.


After killing the nimbus process and restarting.  Its calmed down.  I'll
follow up in the morn or if it happens again.   Im starting to wonder if I
should move away from m1-smalls as I cant have these random spikes in prod.

Thanks a bunch Otis and Michael!

S






On Mon, Mar 3, 2014 at 8:18 PM, Otis Gospodnetic <[email protected]
> wrote:

> Another possibility: sudo grep -i kill /var/log/messages*
> See
> http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Mon, Mar 3, 2014 at 8:54 PM, Michael Rose <[email protected]>wrote:
>
>> Otis,
>>
>> I'm a fan of SPM for Storm, but there's other debugging that needs to be
>> done here if the process quits constantly.
>>
>> Sean,
>>
>> Since you're using storm-deploy, I assume the processes are running under
>> supervisor. It might be worth killing the supervisor by hand, then running
>> it yourself (ssh as storm, cd storm/daemon, supervise .) and seeing what
>> kind of errors you see.
>>
>> Are your disks perhaps filled?
>>
>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> [email protected]
>>
>>
>> On Mon, Mar 3, 2014 at 6:49 PM, Otis Gospodnetic <
>> [email protected]> wrote:
>>
>>> Hi Sean,
>>>
>>> I don't think you can see the metrics you need to see with AWS
>>> CloudWatch.  Have a look at SPM for Storm.  You can share graphs from SPM
>>> directly if you want, so you don't have to grab and attach screenshots
>>> manually. See:
>>>
>>> http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+
>>> http://sematext.com/spm/
>>>
>>> My bet is that you'll see GC metrics spikes....
>>>
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>> On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <[email protected]> wrote:
>>>
>>>> I just created a brand new cluster with storm-deploy command.
>>>>
>>>> lein deploy-storm --start --name storm-dev --commit
>>>> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22
>>>>
>>>>  I had a meeting, did nothing to the box, no topologies were run.  I
>>>> came back 2 hours later and nimbus was at 100% cpu.
>>>>
>>>> I'm running on an m1-small on the following ami - ami-58a3cf68. Im
>>>> unable to get a threaddump as the process is getting killed and restarted
>>>> too fast.  I did attach a 3 hour snapshot of the ec2 monitors.  Any
>>>> guidance would be much appreciated.
>>>>
>>>> Thanks,
>>>> S
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <[email protected]> wrote:
>>>>
>>>>> The only error in the logs is which happened over 10 days ago was.
>>>>>
>>>>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
>>>>> java.io.IOException: Unable to delete directory
>>>>> /mnt/storm/nimbus/stormdist/test-25-1393022928.
>>>>>         at
>>>>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
>>>>> ~[commons-io-1.4.jar:1.4]
>>>>>         at
>>>>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
>>>>> ~[commons-io-1.4.jar:1.4]
>>>>>         at backtype.storm.util$rmr.invoke(util.clj:442)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>>>>>         at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
>>>>>
>>>>> Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty
>>>>> easy.
>>>>>
>>>>> Thanks for you help on this!
>>>>>
>>>>> As for my other question.
>>>>>
>>>>> If my trident batch interval is 500ms and I keep the spout pending and
>>>>> batch size small enough, will I be able to get real time results (ie sub 2
>>>>> seconds)?  I've played with the various metrics (I literally have a
>>>>> spreadsheet of parameters to results) and haven't been able to get it.  Am
>>>>> I just doing it wrong?  What would the key parameters be?  The complete
>>>>> latency is 500 ms but trident seems to be way behind despite non of my
>>>>> bolts having a capacity > 0.6.  This may have to do with nimbus being
>>>>> throttled so I will report back.  But if there are people out there who
>>>>> have done this kind of thing, Id like to know if Im missing an obvious
>>>>> parameter or something.
>>>>>
>>>>> Thanks,
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> The fact that the process is being killed constantly is a red flag.
>>>>>> Also, why are you running it as a client VM?
>>>>>>
>>>>>> Check your nimbus.log to see why it's restarting.
>>>>>>
>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <[email protected]> wrote:
>>>>>>
>>>>>>>   uintx ErgoHeapSizeLimit                         = 0
>>>>>>> {product}
>>>>>>>     uintx InitialHeapSize                          := 27080896
>>>>>>>  {product}
>>>>>>>     uintx LargePageHeapSizeThreshold                = 134217728
>>>>>>>   {product}
>>>>>>>     uintx MaxHeapSize                              := 698351616
>>>>>>>   {product}
>>>>>>>
>>>>>>>
>>>>>>> so initial size of ~25mb and max of ~666 mb
>>>>>>>
>>>>>>> Its a client process (not server ie the command is "java -client
>>>>>>> -Dstorm.options...").  The process gets killed and restarted continously
>>>>>>> with a new PID (which makes getting the PID tough to get stats on).  I 
>>>>>>> dont
>>>>>>> have VisualVM but if I run
>>>>>>>
>>>>>>> jstat -gc PID, I get
>>>>>>>
>>>>>>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU
>>>>>>>   PC     PU    YGC     YGCT    FGC    FGCT     GCT
>>>>>>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>>>>>>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>>>>>>
>>>>>>> At this point I'll likely just rebuild the cluster.  Its not in prod
>>>>>>> yet as I still need to tune it.  I should have wrote 2 separate emails 
>>>>>>> :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> S
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I'm not seeing too much to substantiate that. What size heap are
>>>>>>>> you running, and is it near filled? Perhaps attach VisualVM and check 
>>>>>>>> for
>>>>>>>> GC activity.
>>>>>>>>
>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Here it is.  Appears to be some kind of race condition.
>>>>>>>>>
>>>>>>>>> http://pastebin.com/dANT8SQR
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step
>>>>>>>>>> to figure this out.
>>>>>>>>>>
>>>>>>>>>> I just checked on our Nimbus and while it's on a larger machine,
>>>>>>>>>> it's using <1% CPU. Also look in your logs for any clues.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]>wrote:
>>>>>>>>>>
>>>>>>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>>>>>>
>>>>>>>>>>> I suppose I could just create a new cluster but Id like to know
>>>>>>>>>>> why this is occurring to avoid future production outages.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> S
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>>>>>>
>>>>>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This is the first step of 4. When I save to db I'm actually
>>>>>>>>>>>>> saving to a queue, (just using db for now).  The 2nd step we 
>>>>>>>>>>>>> index the data
>>>>>>>>>>>>> and 3rd we do aggregation/counts for reporting.  The last is a 
>>>>>>>>>>>>> search that
>>>>>>>>>>>>> I'm planning on using drpc for.  Within step 2 we pipe certain 
>>>>>>>>>>>>> datasets in
>>>>>>>>>>>>> real time to the clients it applies to.  I'd like this and the 
>>>>>>>>>>>>> drpc to be
>>>>>>>>>>>>> sub 2s which should be reasonable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your right that I could speed up step 1 by not using trident
>>>>>>>>>>>>> but our requirements seem like a good use case for the other 3 
>>>>>>>>>>>>> steps.  With
>>>>>>>>>>>>> many results per second batching should effect performance a ton 
>>>>>>>>>>>>> if the
>>>>>>>>>>>>> batch size is small enough.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>>>>>>> killed?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a reason you are using trident?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you don't need to handle the events as a batch, you are
>>>>>>>>>>>>> probably going to get performance w/o it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]>wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - 4 spouts of events
>>>>>>>>>>>>>> - merges into one stream
>>>>>>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>>>>>>> - saves to db
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I split the serialization task away from the spout as it was
>>>>>>>>>>>>>> cpu intensive to speed it up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem I have is that after 10 minutes there is over
>>>>>>>>>>>>>> 910k tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>>>>>>> serialization one.
>>>>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>>>>>>> process latency.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So it seems trident has all the records internally, but I
>>>>>>>>>>>>>> need these events as close to realtime as possible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending 
>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>> batch size?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although 
>>>>>>>>>>>>>> CPU usage
>>>>>>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that 
>>>>>>>>>>>>>> be?  Its at
>>>>>>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are currently targeting processing 200 million records per
>>>>>>>>>>>>>> day which seems like it should be quite easy based on what Ive 
>>>>>>>>>>>>>> read that
>>>>>>>>>>>>>> other people have achieved.  I realize that hardware should be 
>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>> boost this as well but my first goal is to get trident to push 
>>>>>>>>>>>>>> the records
>>>>>>>>>>>>>> to the db quicker.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>> Sean
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ce n'est pas une signature
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>>>>> Solbak Technologies Inc.
>>>>>>>>>>> 780.893.7326 (m)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>>> Solbak Technologies Inc.
>>>>>>>>> 780.893.7326 (m)
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>> Solbak Technologies Inc.
>>>>>>> 780.893.7326 (m)
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>>
>>>>> Sean Solbak, BsC, MCSD
>>>>> Solbak Technologies Inc.
>>>>> 780.893.7326 (m)
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>
>


-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Reply via email to