Another possibility: sudo grep -i kill /var/log/messages* See http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html
Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Mon, Mar 3, 2014 at 8:54 PM, Michael Rose <[email protected]>wrote: > Otis, > > I'm a fan of SPM for Storm, but there's other debugging that needs to be > done here if the process quits constantly. > > Sean, > > Since you're using storm-deploy, I assume the processes are running under > supervisor. It might be worth killing the supervisor by hand, then running > it yourself (ssh as storm, cd storm/daemon, supervise .) and seeing what > kind of errors you see. > > Are your disks perhaps filled? > > Michael Rose (@Xorlev <https://twitter.com/xorlev>) > Senior Platform Engineer, FullContact <http://www.fullcontact.com/> > [email protected] > > > On Mon, Mar 3, 2014 at 6:49 PM, Otis Gospodnetic < > [email protected]> wrote: > >> Hi Sean, >> >> I don't think you can see the metrics you need to see with AWS >> CloudWatch. Have a look at SPM for Storm. You can share graphs from SPM >> directly if you want, so you don't have to grab and attach screenshots >> manually. See: >> >> http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+ >> http://sematext.com/spm/ >> >> My bet is that you'll see GC metrics spikes.... >> >> Otis >> -- >> Performance Monitoring * Log Analytics * Search Analytics >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <[email protected]> wrote: >> >>> I just created a brand new cluster with storm-deploy command. >>> >>> lein deploy-storm --start --name storm-dev --commit >>> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22 >>> >>> I had a meeting, did nothing to the box, no topologies were run. I >>> came back 2 hours later and nimbus was at 100% cpu. >>> >>> I'm running on an m1-small on the following ami - ami-58a3cf68. Im >>> unable to get a threaddump as the process is getting killed and restarted >>> too fast. I did attach a 3 hour snapshot of the ec2 monitors. Any >>> guidance would be much appreciated. >>> >>> Thanks, >>> S >>> >>> >>> >>> >>> >>> >>> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <[email protected]> wrote: >>> >>>> The only error in the logs is which happened over 10 days ago was. >>>> >>>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event >>>> java.io.IOException: Unable to delete directory >>>> /mnt/storm/nimbus/stormdist/test-25-1393022928. >>>> at >>>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981) >>>> ~[commons-io-1.4.jar:1.4] >>>> at >>>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381) >>>> ~[commons-io-1.4.jar:1.4] >>>> at backtype.storm.util$rmr.invoke(util.clj:442) >>>> ~[storm-core-0.9.0.1.jar:na] >>>> at >>>> backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819) >>>> ~[storm-core-0.9.0.1.jar:na] >>>> at >>>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896) >>>> ~[storm-core-0.9.0.1.jar:na] >>>> at >>>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77) >>>> ~[storm-core-0.9.0.1.jar:na] >>>> at >>>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33) >>>> ~[storm-core-0.9.0.1.jar:na] >>>> at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26) >>>> ~[storm-core-0.9.0.1.jar:na] >>>> at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na] >>>> at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27] >>>> >>>> Its fine. I can rebuild a new cluster. Storm deploy makes it pretty >>>> easy. >>>> >>>> Thanks for you help on this! >>>> >>>> As for my other question. >>>> >>>> If my trident batch interval is 500ms and I keep the spout pending and >>>> batch size small enough, will I be able to get real time results (ie sub 2 >>>> seconds)? I've played with the various metrics (I literally have a >>>> spreadsheet of parameters to results) and haven't been able to get it. Am >>>> I just doing it wrong? What would the key parameters be? The complete >>>> latency is 500 ms but trident seems to be way behind despite non of my >>>> bolts having a capacity > 0.6. This may have to do with nimbus being >>>> throttled so I will report back. But if there are people out there who >>>> have done this kind of thing, Id like to know if Im missing an obvious >>>> parameter or something. >>>> >>>> Thanks, >>>> S >>>> >>>> >>>> >>>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose >>>> <[email protected]>wrote: >>>> >>>>> The fact that the process is being killed constantly is a red flag. >>>>> Also, why are you running it as a client VM? >>>>> >>>>> Check your nimbus.log to see why it's restarting. >>>>> >>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>> [email protected] >>>>> >>>>> >>>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <[email protected]> wrote: >>>>> >>>>>> uintx ErgoHeapSizeLimit = 0 >>>>>> {product} >>>>>> uintx InitialHeapSize := 27080896 >>>>>> {product} >>>>>> uintx LargePageHeapSizeThreshold = 134217728 >>>>>> {product} >>>>>> uintx MaxHeapSize := 698351616 >>>>>> {product} >>>>>> >>>>>> >>>>>> so initial size of ~25mb and max of ~666 mb >>>>>> >>>>>> Its a client process (not server ie the command is "java -client >>>>>> -Dstorm.options..."). The process gets killed and restarted continously >>>>>> with a new PID (which makes getting the PID tough to get stats on). I >>>>>> dont >>>>>> have VisualVM but if I run >>>>>> >>>>>> jstat -gc PID, I get >>>>>> >>>>>> S0C S1C S0U S1U EC EU OC OU >>>>>> PC PU YGC YGCT FGC FGCT GCT >>>>>> 832.0 832.0 0.0 352.9 7168.0 1115.9 17664.0 1796.0 >>>>>> 21248.0 16029.6 5 0.268 0 0.000 0.268 >>>>>> >>>>>> At this point I'll likely just rebuild the cluster. Its not in prod >>>>>> yet as I still need to tune it. I should have wrote 2 separate emails :) >>>>>> >>>>>> Thanks, >>>>>> S >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> I'm not seeing too much to substantiate that. What size heap are you >>>>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC >>>>>>> activity. >>>>>>> >>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote: >>>>>>> >>>>>>>> Here it is. Appears to be some kind of race condition. >>>>>>>> >>>>>>>> http://pastebin.com/dANT8SQR >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step >>>>>>>>> to figure this out. >>>>>>>>> >>>>>>>>> I just checked on our Nimbus and while it's on a larger machine, >>>>>>>>> it's using <1% CPU. Also look in your logs for any clues. >>>>>>>>> >>>>>>>>> >>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/> >>>>>>>>> [email protected] >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> No, they are on seperate machines. Its a 4 machine cluster - 2 >>>>>>>>>> workers, 1 nimbus and 1 zookeeper. >>>>>>>>>> >>>>>>>>>> I suppose I could just create a new cluster but Id like to know >>>>>>>>>> why this is occurring to avoid future production outages. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> S >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box? >>>>>>>>>>> >>>>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/> >>>>>>>>>>> [email protected] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]>wrote: >>>>>>>>>>> >>>>>>>>>>>> This is the first step of 4. When I save to db I'm actually >>>>>>>>>>>> saving to a queue, (just using db for now). The 2nd step we index >>>>>>>>>>>> the data >>>>>>>>>>>> and 3rd we do aggregation/counts for reporting. The last is a >>>>>>>>>>>> search that >>>>>>>>>>>> I'm planning on using drpc for. Within step 2 we pipe certain >>>>>>>>>>>> datasets in >>>>>>>>>>>> real time to the clients it applies to. I'd like this and the >>>>>>>>>>>> drpc to be >>>>>>>>>>>> sub 2s which should be reasonable. >>>>>>>>>>>> >>>>>>>>>>>> Your right that I could speed up step 1 by not using trident >>>>>>>>>>>> but our requirements seem like a good use case for the other 3 >>>>>>>>>>>> steps. With >>>>>>>>>>>> many results per second batching should effect performance a ton >>>>>>>>>>>> if the >>>>>>>>>>>> batch size is small enough. >>>>>>>>>>>> >>>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies >>>>>>>>>>>> killed? >>>>>>>>>>>> >>>>>>>>>>>> Sent from my iPhone >>>>>>>>>>>> >>>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Is there a reason you are using trident? >>>>>>>>>>>> >>>>>>>>>>>> If you don't need to handle the events as a batch, you are >>>>>>>>>>>> probably going to get performance w/o it. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]>wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Im writing a fairly basic trident topology as follows: >>>>>>>>>>>>> >>>>>>>>>>>>> - 4 spouts of events >>>>>>>>>>>>> - merges into one stream >>>>>>>>>>>>> - serializes the object as an event in a string >>>>>>>>>>>>> - saves to db >>>>>>>>>>>>> >>>>>>>>>>>>> I split the serialization task away from the spout as it was >>>>>>>>>>>>> cpu intensive to speed it up. >>>>>>>>>>>>> >>>>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k >>>>>>>>>>>>> tuples emitted/transfered but only 193k records are saved. >>>>>>>>>>>>> >>>>>>>>>>>>> The overall load of the topology seems fine. >>>>>>>>>>>>> >>>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level >>>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the >>>>>>>>>>>>> serialization one. >>>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms >>>>>>>>>>>>> process latency. >>>>>>>>>>>>> >>>>>>>>>>>>> So it seems trident has all the records internally, but I need >>>>>>>>>>>>> these events as close to realtime as possible. >>>>>>>>>>>>> >>>>>>>>>>>>> Does anyone have any guidance as to how to increase the >>>>>>>>>>>>> throughput? Is it simply a matter of tweeking max spout pending >>>>>>>>>>>>> and the >>>>>>>>>>>>> batch size? >>>>>>>>>>>>> >>>>>>>>>>>>> Im running it on 2 m1-smalls for now. I dont see the need to >>>>>>>>>>>>> upgrade it until the demand on the boxes seems higher. Although >>>>>>>>>>>>> CPU usage >>>>>>>>>>>>> on the nimbus box is pinned. Its at like 99%. Why would that >>>>>>>>>>>>> be? Its at >>>>>>>>>>>>> 99% even when all the topologies are killed. >>>>>>>>>>>>> >>>>>>>>>>>>> We are currently targeting processing 200 million records per >>>>>>>>>>>>> day which seems like it should be quite easy based on what Ive >>>>>>>>>>>>> read that >>>>>>>>>>>>> other people have achieved. I realize that hardware should be >>>>>>>>>>>>> able to >>>>>>>>>>>>> boost this as well but my first goal is to get trident to push >>>>>>>>>>>>> the records >>>>>>>>>>>>> to the db quicker. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>>> Sean >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> Ce n'est pas une signature >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Sean Solbak, BsC, MCSD >>>>>>>>>> Solbak Technologies Inc. >>>>>>>>>> 780.893.7326 (m) >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Sean Solbak, BsC, MCSD >>>>>>>> Solbak Technologies Inc. >>>>>>>> 780.893.7326 (m) >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> >>>>>> Sean Solbak, BsC, MCSD >>>>>> Solbak Technologies Inc. >>>>>> 780.893.7326 (m) >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> >>>> Sean Solbak, BsC, MCSD >>>> Solbak Technologies Inc. >>>> 780.893.7326 (m) >>>> >>> >>> >>> >>> -- >>> Thanks, >>> >>> Sean Solbak, BsC, MCSD >>> Solbak Technologies Inc. >>> 780.893.7326 (m) >>> >> >> >
