Hi Sean, I don't think you can see the metrics you need to see with AWS CloudWatch. Have a look at SPM for Storm. You can share graphs from SPM directly if you want, so you don't have to grab and attach screenshots manually. See: http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+ http://sematext.com/spm/
My bet is that you'll see GC metrics spikes.... Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <[email protected]> wrote: > I just created a brand new cluster with storm-deploy command. > > lein deploy-storm --start --name storm-dev --commit > 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22 > > I had a meeting, did nothing to the box, no topologies were run. I came > back 2 hours later and nimbus was at 100% cpu. > > I'm running on an m1-small on the following ami - ami-58a3cf68. Im unable > to get a threaddump as the process is getting killed and restarted too > fast. I did attach a 3 hour snapshot of the ec2 monitors. Any guidance > would be much appreciated. > > Thanks, > S > > > > > > > On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <[email protected]> wrote: > >> The only error in the logs is which happened over 10 days ago was. >> >> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event >> java.io.IOException: Unable to delete directory >> /mnt/storm/nimbus/stormdist/test-25-1393022928. >> at >> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981) >> ~[commons-io-1.4.jar:1.4] >> at >> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381) >> ~[commons-io-1.4.jar:1.4] >> at backtype.storm.util$rmr.invoke(util.clj:442) >> ~[storm-core-0.9.0.1.jar:na] >> at backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819) >> ~[storm-core-0.9.0.1.jar:na] >> at >> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896) >> ~[storm-core-0.9.0.1.jar:na] >> at >> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77) >> ~[storm-core-0.9.0.1.jar:na] >> at >> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33) >> ~[storm-core-0.9.0.1.jar:na] >> at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26) >> ~[storm-core-0.9.0.1.jar:na] >> at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na] >> at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27] >> >> Its fine. I can rebuild a new cluster. Storm deploy makes it pretty >> easy. >> >> Thanks for you help on this! >> >> As for my other question. >> >> If my trident batch interval is 500ms and I keep the spout pending and >> batch size small enough, will I be able to get real time results (ie sub 2 >> seconds)? I've played with the various metrics (I literally have a >> spreadsheet of parameters to results) and haven't been able to get it. Am >> I just doing it wrong? What would the key parameters be? The complete >> latency is 500 ms but trident seems to be way behind despite non of my >> bolts having a capacity > 0.6. This may have to do with nimbus being >> throttled so I will report back. But if there are people out there who >> have done this kind of thing, Id like to know if Im missing an obvious >> parameter or something. >> >> Thanks, >> S >> >> >> >> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <[email protected]>wrote: >> >>> The fact that the process is being killed constantly is a red flag. >>> Also, why are you running it as a client VM? >>> >>> Check your nimbus.log to see why it's restarting. >>> >>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>> [email protected] >>> >>> >>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <[email protected]> wrote: >>> >>>> uintx ErgoHeapSizeLimit = 0 >>>> {product} >>>> uintx InitialHeapSize := 27080896 >>>> {product} >>>> uintx LargePageHeapSizeThreshold = 134217728 >>>> {product} >>>> uintx MaxHeapSize := 698351616 >>>> {product} >>>> >>>> >>>> so initial size of ~25mb and max of ~666 mb >>>> >>>> Its a client process (not server ie the command is "java -client >>>> -Dstorm.options..."). The process gets killed and restarted continously >>>> with a new PID (which makes getting the PID tough to get stats on). I dont >>>> have VisualVM but if I run >>>> >>>> jstat -gc PID, I get >>>> >>>> S0C S1C S0U S1U EC EU OC OU >>>> PC PU YGC YGCT FGC FGCT GCT >>>> 832.0 832.0 0.0 352.9 7168.0 1115.9 17664.0 1796.0 >>>> 21248.0 16029.6 5 0.268 0 0.000 0.268 >>>> >>>> At this point I'll likely just rebuild the cluster. Its not in prod >>>> yet as I still need to tune it. I should have wrote 2 separate emails :) >>>> >>>> Thanks, >>>> S >>>> >>>> >>>> >>>> >>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose >>>> <[email protected]>wrote: >>>> >>>>> I'm not seeing too much to substantiate that. What size heap are you >>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC >>>>> activity. >>>>> >>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>> [email protected] >>>>> >>>>> >>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote: >>>>> >>>>>> Here it is. Appears to be some kind of race condition. >>>>>> >>>>>> http://pastebin.com/dANT8SQR >>>>>> >>>>>> >>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Can you do a thread dump and pastebin it? It's a nice first step to >>>>>>> figure this out. >>>>>>> >>>>>>> I just checked on our Nimbus and while it's on a larger machine, >>>>>>> it's using <1% CPU. Also look in your logs for any clues. >>>>>>> >>>>>>> >>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]> wrote: >>>>>>> >>>>>>>> No, they are on seperate machines. Its a 4 machine cluster - 2 >>>>>>>> workers, 1 nimbus and 1 zookeeper. >>>>>>>> >>>>>>>> I suppose I could just create a new cluster but Id like to know why >>>>>>>> this is occurring to avoid future production outages. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> S >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box? >>>>>>>>> >>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/> >>>>>>>>> [email protected] >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> This is the first step of 4. When I save to db I'm actually >>>>>>>>>> saving to a queue, (just using db for now). The 2nd step we index >>>>>>>>>> the data >>>>>>>>>> and 3rd we do aggregation/counts for reporting. The last is a >>>>>>>>>> search that >>>>>>>>>> I'm planning on using drpc for. Within step 2 we pipe certain >>>>>>>>>> datasets in >>>>>>>>>> real time to the clients it applies to. I'd like this and the drpc >>>>>>>>>> to be >>>>>>>>>> sub 2s which should be reasonable. >>>>>>>>>> >>>>>>>>>> Your right that I could speed up step 1 by not using trident but >>>>>>>>>> our requirements seem like a good use case for the other 3 steps. >>>>>>>>>> With >>>>>>>>>> many results per second batching should effect performance a ton if >>>>>>>>>> the >>>>>>>>>> batch size is small enough. >>>>>>>>>> >>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies >>>>>>>>>> killed? >>>>>>>>>> >>>>>>>>>> Sent from my iPhone >>>>>>>>>> >>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Is there a reason you are using trident? >>>>>>>>>> >>>>>>>>>> If you don't need to handle the events as a batch, you are >>>>>>>>>> probably going to get performance w/o it. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]>wrote: >>>>>>>>>> >>>>>>>>>>> Im writing a fairly basic trident topology as follows: >>>>>>>>>>> >>>>>>>>>>> - 4 spouts of events >>>>>>>>>>> - merges into one stream >>>>>>>>>>> - serializes the object as an event in a string >>>>>>>>>>> - saves to db >>>>>>>>>>> >>>>>>>>>>> I split the serialization task away from the spout as it was cpu >>>>>>>>>>> intensive to speed it up. >>>>>>>>>>> >>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k >>>>>>>>>>> tuples emitted/transfered but only 193k records are saved. >>>>>>>>>>> >>>>>>>>>>> The overall load of the topology seems fine. >>>>>>>>>>> >>>>>>>>>>> - 536.404 ms complete latency at the topolgy level >>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the >>>>>>>>>>> serialization one. >>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms >>>>>>>>>>> process latency. >>>>>>>>>>> >>>>>>>>>>> So it seems trident has all the records internally, but I need >>>>>>>>>>> these events as close to realtime as possible. >>>>>>>>>>> >>>>>>>>>>> Does anyone have any guidance as to how to increase the >>>>>>>>>>> throughput? Is it simply a matter of tweeking max spout pending >>>>>>>>>>> and the >>>>>>>>>>> batch size? >>>>>>>>>>> >>>>>>>>>>> Im running it on 2 m1-smalls for now. I dont see the need to >>>>>>>>>>> upgrade it until the demand on the boxes seems higher. Although >>>>>>>>>>> CPU usage >>>>>>>>>>> on the nimbus box is pinned. Its at like 99%. Why would that be? >>>>>>>>>>> Its at >>>>>>>>>>> 99% even when all the topologies are killed. >>>>>>>>>>> >>>>>>>>>>> We are currently targeting processing 200 million records per >>>>>>>>>>> day which seems like it should be quite easy based on what Ive read >>>>>>>>>>> that >>>>>>>>>>> other people have achieved. I realize that hardware should be able >>>>>>>>>>> to >>>>>>>>>>> boost this as well but my first goal is to get trident to push the >>>>>>>>>>> records >>>>>>>>>>> to the db quicker. >>>>>>>>>>> >>>>>>>>>>> Thanks in advance, >>>>>>>>>>> Sean >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Ce n'est pas une signature >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Sean Solbak, BsC, MCSD >>>>>>>> Solbak Technologies Inc. >>>>>>>> 780.893.7326 (m) >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> >>>>>> Sean Solbak, BsC, MCSD >>>>>> Solbak Technologies Inc. >>>>>> 780.893.7326 (m) >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> >>>> Sean Solbak, BsC, MCSD >>>> Solbak Technologies Inc. >>>> 780.893.7326 (m) >>>> >>> >>> >> >> >> -- >> Thanks, >> >> Sean Solbak, BsC, MCSD >> Solbak Technologies Inc. >> 780.893.7326 (m) >> > > > > -- > Thanks, > > Sean Solbak, BsC, MCSD > Solbak Technologies Inc. > 780.893.7326 (m) >
