Otis, I'm a fan of SPM for Storm, but there's other debugging that needs to be done here if the process quits constantly.
Sean, Since you're using storm-deploy, I assume the processes are running under supervisor. It might be worth killing the supervisor by hand, then running it yourself (ssh as storm, cd storm/daemon, supervise .) and seeing what kind of errors you see. Are your disks perhaps filled? Michael Rose (@Xorlev <https://twitter.com/xorlev>) Senior Platform Engineer, FullContact <http://www.fullcontact.com/> [email protected] On Mon, Mar 3, 2014 at 6:49 PM, Otis Gospodnetic <[email protected] > wrote: > Hi Sean, > > I don't think you can see the metrics you need to see with AWS CloudWatch. > Have a look at SPM for Storm. You can share graphs from SPM directly if > you want, so you don't have to grab and attach screenshots manually. See: > > http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+ > http://sematext.com/spm/ > > My bet is that you'll see GC metrics spikes.... > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <[email protected]> wrote: > >> I just created a brand new cluster with storm-deploy command. >> >> lein deploy-storm --start --name storm-dev --commit >> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22 >> >> I had a meeting, did nothing to the box, no topologies were run. I came >> back 2 hours later and nimbus was at 100% cpu. >> >> I'm running on an m1-small on the following ami - ami-58a3cf68. Im >> unable to get a threaddump as the process is getting killed and restarted >> too fast. I did attach a 3 hour snapshot of the ec2 monitors. Any >> guidance would be much appreciated. >> >> Thanks, >> S >> >> >> >> >> >> >> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <[email protected]> wrote: >> >>> The only error in the logs is which happened over 10 days ago was. >>> >>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event >>> java.io.IOException: Unable to delete directory >>> /mnt/storm/nimbus/stormdist/test-25-1393022928. >>> at >>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981) >>> ~[commons-io-1.4.jar:1.4] >>> at >>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381) >>> ~[commons-io-1.4.jar:1.4] >>> at backtype.storm.util$rmr.invoke(util.clj:442) >>> ~[storm-core-0.9.0.1.jar:na] >>> at >>> backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819) >>> ~[storm-core-0.9.0.1.jar:na] >>> at >>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896) >>> ~[storm-core-0.9.0.1.jar:na] >>> at >>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77) >>> ~[storm-core-0.9.0.1.jar:na] >>> at >>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33) >>> ~[storm-core-0.9.0.1.jar:na] >>> at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26) >>> ~[storm-core-0.9.0.1.jar:na] >>> at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na] >>> at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27] >>> >>> Its fine. I can rebuild a new cluster. Storm deploy makes it pretty >>> easy. >>> >>> Thanks for you help on this! >>> >>> As for my other question. >>> >>> If my trident batch interval is 500ms and I keep the spout pending and >>> batch size small enough, will I be able to get real time results (ie sub 2 >>> seconds)? I've played with the various metrics (I literally have a >>> spreadsheet of parameters to results) and haven't been able to get it. Am >>> I just doing it wrong? What would the key parameters be? The complete >>> latency is 500 ms but trident seems to be way behind despite non of my >>> bolts having a capacity > 0.6. This may have to do with nimbus being >>> throttled so I will report back. But if there are people out there who >>> have done this kind of thing, Id like to know if Im missing an obvious >>> parameter or something. >>> >>> Thanks, >>> S >>> >>> >>> >>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <[email protected]>wrote: >>> >>>> The fact that the process is being killed constantly is a red flag. >>>> Also, why are you running it as a client VM? >>>> >>>> Check your nimbus.log to see why it's restarting. >>>> >>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>> [email protected] >>>> >>>> >>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <[email protected]> wrote: >>>> >>>>> uintx ErgoHeapSizeLimit = 0 >>>>> {product} >>>>> uintx InitialHeapSize := 27080896 >>>>> {product} >>>>> uintx LargePageHeapSizeThreshold = 134217728 >>>>> {product} >>>>> uintx MaxHeapSize := 698351616 >>>>> {product} >>>>> >>>>> >>>>> so initial size of ~25mb and max of ~666 mb >>>>> >>>>> Its a client process (not server ie the command is "java -client >>>>> -Dstorm.options..."). The process gets killed and restarted continously >>>>> with a new PID (which makes getting the PID tough to get stats on). I >>>>> dont >>>>> have VisualVM but if I run >>>>> >>>>> jstat -gc PID, I get >>>>> >>>>> S0C S1C S0U S1U EC EU OC OU >>>>> PC PU YGC YGCT FGC FGCT GCT >>>>> 832.0 832.0 0.0 352.9 7168.0 1115.9 17664.0 1796.0 >>>>> 21248.0 16029.6 5 0.268 0 0.000 0.268 >>>>> >>>>> At this point I'll likely just rebuild the cluster. Its not in prod >>>>> yet as I still need to tune it. I should have wrote 2 separate emails :) >>>>> >>>>> Thanks, >>>>> S >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose >>>>> <[email protected]>wrote: >>>>> >>>>>> I'm not seeing too much to substantiate that. What size heap are you >>>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC >>>>>> activity. >>>>>> >>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>>> [email protected] >>>>>> >>>>>> >>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote: >>>>>> >>>>>>> Here it is. Appears to be some kind of race condition. >>>>>>> >>>>>>> http://pastebin.com/dANT8SQR >>>>>>> >>>>>>> >>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step to >>>>>>>> figure this out. >>>>>>>> >>>>>>>> I just checked on our Nimbus and while it's on a larger machine, >>>>>>>> it's using <1% CPU. Also look in your logs for any clues. >>>>>>>> >>>>>>>> >>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>>>>> [email protected] >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]> wrote: >>>>>>>> >>>>>>>>> No, they are on seperate machines. Its a 4 machine cluster - 2 >>>>>>>>> workers, 1 nimbus and 1 zookeeper. >>>>>>>>> >>>>>>>>> I suppose I could just create a new cluster but Id like to know >>>>>>>>> why this is occurring to avoid future production outages. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> S >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box? >>>>>>>>>> >>>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/> >>>>>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]>wrote: >>>>>>>>>> >>>>>>>>>>> This is the first step of 4. When I save to db I'm actually >>>>>>>>>>> saving to a queue, (just using db for now). The 2nd step we index >>>>>>>>>>> the data >>>>>>>>>>> and 3rd we do aggregation/counts for reporting. The last is a >>>>>>>>>>> search that >>>>>>>>>>> I'm planning on using drpc for. Within step 2 we pipe certain >>>>>>>>>>> datasets in >>>>>>>>>>> real time to the clients it applies to. I'd like this and the drpc >>>>>>>>>>> to be >>>>>>>>>>> sub 2s which should be reasonable. >>>>>>>>>>> >>>>>>>>>>> Your right that I could speed up step 1 by not using trident but >>>>>>>>>>> our requirements seem like a good use case for the other 3 steps. >>>>>>>>>>> With >>>>>>>>>>> many results per second batching should effect performance a ton if >>>>>>>>>>> the >>>>>>>>>>> batch size is small enough. >>>>>>>>>>> >>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies >>>>>>>>>>> killed? >>>>>>>>>>> >>>>>>>>>>> Sent from my iPhone >>>>>>>>>>> >>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> Is there a reason you are using trident? >>>>>>>>>>> >>>>>>>>>>> If you don't need to handle the events as a batch, you are >>>>>>>>>>> probably going to get performance w/o it. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]>wrote: >>>>>>>>>>> >>>>>>>>>>>> Im writing a fairly basic trident topology as follows: >>>>>>>>>>>> >>>>>>>>>>>> - 4 spouts of events >>>>>>>>>>>> - merges into one stream >>>>>>>>>>>> - serializes the object as an event in a string >>>>>>>>>>>> - saves to db >>>>>>>>>>>> >>>>>>>>>>>> I split the serialization task away from the spout as it was >>>>>>>>>>>> cpu intensive to speed it up. >>>>>>>>>>>> >>>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k >>>>>>>>>>>> tuples emitted/transfered but only 193k records are saved. >>>>>>>>>>>> >>>>>>>>>>>> The overall load of the topology seems fine. >>>>>>>>>>>> >>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level >>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the >>>>>>>>>>>> serialization one. >>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms >>>>>>>>>>>> process latency. >>>>>>>>>>>> >>>>>>>>>>>> So it seems trident has all the records internally, but I need >>>>>>>>>>>> these events as close to realtime as possible. >>>>>>>>>>>> >>>>>>>>>>>> Does anyone have any guidance as to how to increase the >>>>>>>>>>>> throughput? Is it simply a matter of tweeking max spout pending >>>>>>>>>>>> and the >>>>>>>>>>>> batch size? >>>>>>>>>>>> >>>>>>>>>>>> Im running it on 2 m1-smalls for now. I dont see the need to >>>>>>>>>>>> upgrade it until the demand on the boxes seems higher. Although >>>>>>>>>>>> CPU usage >>>>>>>>>>>> on the nimbus box is pinned. Its at like 99%. Why would that be? >>>>>>>>>>>> Its at >>>>>>>>>>>> 99% even when all the topologies are killed. >>>>>>>>>>>> >>>>>>>>>>>> We are currently targeting processing 200 million records per >>>>>>>>>>>> day which seems like it should be quite easy based on what Ive >>>>>>>>>>>> read that >>>>>>>>>>>> other people have achieved. I realize that hardware should be >>>>>>>>>>>> able to >>>>>>>>>>>> boost this as well but my first goal is to get trident to push the >>>>>>>>>>>> records >>>>>>>>>>>> to the db quicker. >>>>>>>>>>>> >>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>> Sean >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Ce n'est pas une signature >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Sean Solbak, BsC, MCSD >>>>>>>>> Solbak Technologies Inc. >>>>>>>>> 780.893.7326 (m) >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Thanks, >>>>>>> >>>>>>> Sean Solbak, BsC, MCSD >>>>>>> Solbak Technologies Inc. >>>>>>> 780.893.7326 (m) >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> >>>>> Sean Solbak, BsC, MCSD >>>>> Solbak Technologies Inc. >>>>> 780.893.7326 (m) >>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks, >>> >>> Sean Solbak, BsC, MCSD >>> Solbak Technologies Inc. >>> 780.893.7326 (m) >>> >> >> >> >> -- >> Thanks, >> >> Sean Solbak, BsC, MCSD >> Solbak Technologies Inc. >> 780.893.7326 (m) >> > >
