Re: Tuning and nimbus at 99%

Otis Gospodnetic Mon, 03 Mar 2014 19:19:37 -0800

Another possibility: sudo grep -i kill /var/log/messages*
See
http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html


Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Mon, Mar 3, 2014 at 8:54 PM, Michael Rose <[email protected]>wrote:

> Otis,
>
> I'm a fan of SPM for Storm, but there's other debugging that needs to be
> done here if the process quits constantly.
>
> Sean,
>
> Since you're using storm-deploy, I assume the processes are running under
> supervisor. It might be worth killing the supervisor by hand, then running
> it yourself (ssh as storm, cd storm/daemon, supervise .) and seeing what
> kind of errors you see.
>
> Are your disks perhaps filled?
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> [email protected]
>
>
> On Mon, Mar 3, 2014 at 6:49 PM, Otis Gospodnetic <
> [email protected]> wrote:
>
>> Hi Sean,
>>
>> I don't think you can see the metrics you need to see with AWS
>> CloudWatch.  Have a look at SPM for Storm.  You can share graphs from SPM
>> directly if you want, so you don't have to grab and attach screenshots
>> manually. See:
>>
>> http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+
>> http://sematext.com/spm/
>>
>> My bet is that you'll see GC metrics spikes....
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <[email protected]> wrote:
>>
>>> I just created a brand new cluster with storm-deploy command.
>>>
>>> lein deploy-storm --start --name storm-dev --commit
>>> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22
>>>
>>>  I had a meeting, did nothing to the box, no topologies were run.  I
>>> came back 2 hours later and nimbus was at 100% cpu.
>>>
>>> I'm running on an m1-small on the following ami - ami-58a3cf68. Im
>>> unable to get a threaddump as the process is getting killed and restarted
>>> too fast.  I did attach a 3 hour snapshot of the ec2 monitors.  Any
>>> guidance would be much appreciated.
>>>
>>> Thanks,
>>> S
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <[email protected]> wrote:
>>>
>>>> The only error in the logs is which happened over 10 days ago was.
>>>>
>>>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
>>>> java.io.IOException: Unable to delete directory
>>>> /mnt/storm/nimbus/stormdist/test-25-1393022928.
>>>>         at
>>>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
>>>> ~[commons-io-1.4.jar:1.4]
>>>>         at
>>>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
>>>> ~[commons-io-1.4.jar:1.4]
>>>>         at backtype.storm.util$rmr.invoke(util.clj:442)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>>>>         at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
>>>>
>>>> Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty
>>>> easy.
>>>>
>>>> Thanks for you help on this!
>>>>
>>>> As for my other question.
>>>>
>>>> If my trident batch interval is 500ms and I keep the spout pending and
>>>> batch size small enough, will I be able to get real time results (ie sub 2
>>>> seconds)?  I've played with the various metrics (I literally have a
>>>> spreadsheet of parameters to results) and haven't been able to get it.  Am
>>>> I just doing it wrong?  What would the key parameters be?  The complete
>>>> latency is 500 ms but trident seems to be way behind despite non of my
>>>> bolts having a capacity > 0.6.  This may have to do with nimbus being
>>>> throttled so I will report back.  But if there are people out there who
>>>> have done this kind of thing, Id like to know if Im missing an obvious
>>>> parameter or something.
>>>>
>>>> Thanks,
>>>> S
>>>>
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose 
>>>> <[email protected]>wrote:
>>>>
>>>>> The fact that the process is being killed constantly is a red flag.
>>>>> Also, why are you running it as a client VM?
>>>>>
>>>>> Check your nimbus.log to see why it's restarting.
>>>>>
>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>> [email protected]
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <[email protected]> wrote:
>>>>>
>>>>>>   uintx ErgoHeapSizeLimit                         = 0
>>>>>> {product}
>>>>>>     uintx InitialHeapSize                          := 27080896
>>>>>>  {product}
>>>>>>     uintx LargePageHeapSizeThreshold                = 134217728
>>>>>> {product}
>>>>>>     uintx MaxHeapSize                              := 698351616
>>>>>> {product}
>>>>>>
>>>>>>
>>>>>> so initial size of ~25mb and max of ~666 mb
>>>>>>
>>>>>> Its a client process (not server ie the command is "java -client
>>>>>> -Dstorm.options...").  The process gets killed and restarted continously
>>>>>> with a new PID (which makes getting the PID tough to get stats on).  I 
>>>>>> dont
>>>>>> have VisualVM but if I run
>>>>>>
>>>>>> jstat -gc PID, I get
>>>>>>
>>>>>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU
>>>>>> PC     PU    YGC     YGCT    FGC    FGCT     GCT
>>>>>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>>>>>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>>>>>
>>>>>> At this point I'll likely just rebuild the cluster.  Its not in prod
>>>>>> yet as I still need to tune it.  I should have wrote 2 separate emails :)
>>>>>>
>>>>>> Thanks,
>>>>>> S
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> I'm not seeing too much to substantiate that. What size heap are you
>>>>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>>>>>>> activity.
>>>>>>>
>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote:
>>>>>>>
>>>>>>>> Here it is.  Appears to be some kind of race condition.
>>>>>>>>
>>>>>>>> http://pastebin.com/dANT8SQR
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step
>>>>>>>>> to figure this out.
>>>>>>>>>
>>>>>>>>> I just checked on our Nimbus and while it's on a larger machine,
>>>>>>>>> it's using <1% CPU. Also look in your logs for any clues.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>> [email protected]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>>>>>
>>>>>>>>>> I suppose I could just create a new cluster but Id like to know
>>>>>>>>>> why this is occurring to avoid future production outages.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> S
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>>>>>
>>>>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>>>> [email protected]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]>wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This is the first step of 4. When I save to db I'm actually
>>>>>>>>>>>> saving to a queue, (just using db for now).  The 2nd step we index 
>>>>>>>>>>>> the data
>>>>>>>>>>>> and 3rd we do aggregation/counts for reporting.  The last is a 
>>>>>>>>>>>> search that
>>>>>>>>>>>> I'm planning on using drpc for.  Within step 2 we pipe certain 
>>>>>>>>>>>> datasets in
>>>>>>>>>>>> real time to the clients it applies to.  I'd like this and the 
>>>>>>>>>>>> drpc to be
>>>>>>>>>>>> sub 2s which should be reasonable.
>>>>>>>>>>>>
>>>>>>>>>>>> Your right that I could speed up step 1 by not using trident
>>>>>>>>>>>> but our requirements seem like a good use case for the other 3 
>>>>>>>>>>>> steps.  With
>>>>>>>>>>>> many results per second batching should effect performance a ton 
>>>>>>>>>>>> if the
>>>>>>>>>>>> batch size is small enough.
>>>>>>>>>>>>
>>>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>>>>>> killed?
>>>>>>>>>>>>
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>
>>>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a reason you are using trident?
>>>>>>>>>>>>
>>>>>>>>>>>> If you don't need to handle the events as a batch, you are
>>>>>>>>>>>> probably going to get performance w/o it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - 4 spouts of events
>>>>>>>>>>>>> - merges into one stream
>>>>>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>>>>>> - saves to db
>>>>>>>>>>>>>
>>>>>>>>>>>>> I split the serialization task away from the spout as it was
>>>>>>>>>>>>> cpu intensive to speed it up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>>>>>> serialization one.
>>>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>>>>>> process latency.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>>>>>>> these events as close to realtime as possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending 
>>>>>>>>>>>>> and the
>>>>>>>>>>>>> batch size?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although 
>>>>>>>>>>>>> CPU usage
>>>>>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that 
>>>>>>>>>>>>> be?  Its at
>>>>>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are currently targeting processing 200 million records per
>>>>>>>>>>>>> day which seems like it should be quite easy based on what Ive 
>>>>>>>>>>>>> read that
>>>>>>>>>>>>> other people have achieved.  I realize that hardware should be 
>>>>>>>>>>>>> able to
>>>>>>>>>>>>> boost this as well but my first goal is to get trident to push 
>>>>>>>>>>>>> the records
>>>>>>>>>>>>> to the db quicker.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>> Sean
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Ce n'est pas une signature
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>>>> Solbak Technologies Inc.
>>>>>>>>>> 780.893.7326 (m)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>> Solbak Technologies Inc.
>>>>>>>> 780.893.7326 (m)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>>
>>>>>> Sean Solbak, BsC, MCSD
>>>>>> Solbak Technologies Inc.
>>>>>> 780.893.7326 (m)
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Sean Solbak, BsC, MCSD
>>> Solbak Technologies Inc.
>>> 780.893.7326 (m)
>>>
>>
>>
>

Re: Tuning and nimbus at 99%

Reply via email to