Hello again, I'd like to chip in @Nikolas but what you will have to do... you will probably not like... I really think this is not storm's fault... that would be really weird. Additionally the jvm that you use for local execution are the same as the ones in your cluster? Now... the best way to know that is wrong would be to get profiled data from the workers that die on the cluster... to do that you will need to debug/profile the topology on the cluster -- it's not a trivial task and can be time consuming. Now I'll tell you a kind of "simple" way of getting the data... which is what I did when I needed to debug a cluster topology.
Now, since you don't know which of the nodes of the cluster will be assigned to your topology you'd have to hook-up with all of them... but thankfully you will receive response only from the ones that will actually do the work; this is because the jvm on listening on your specified port will be spawned but the nodes executing your topology. You will have to use childopts argument to pass the port that the debug jvm will be hooked (details on how to hook jvm's remotely are here <http://docs.oracle.com/javase/8/docs/technotes/guides/jpda/conninv.html> and for attaching YK here <https://www.yourkit.com/docs/java/help/attach_agent.jsp>), so make sure it's one that's not already used -- you can use nmap to scan the nodes for available ports. Next, depending on your version of java you should invoke the jvm and attach it to your profiler server -- preferably, use YourKit for that...even in trial mode as it makes the process quite easier than other debuggers. Then review the dump/profiled data to find your issue... if you find that there is an issue with Storm please do include as much details as you can! That will make it easier to find and patch the issue, if any. Let us know if this helped or you require anything else. On Mon, Jan 18, 2016 at 11:54 PM, Yury Ruchin <[email protected]> wrote: > Yes, I suggest you to try and spot the problem by looking at the dump of a > workers that throws the exception. That way you could at least be certain > about what consumes workers memory. > > Oracle HotSpot has a number of options controlling GC logging, setting > them for worker JVMs may help in troubleshooting. Plumbr's Handbook seems > to be a decent reading on that matter: https://plumbr.eu/handbook. > > Since you are using custom spout, could you provide its code, at least the > part that emits tuples? > > 2016-01-15 23:57 GMT+03:00 Nikolaos Pavlakis <[email protected]>: > >> Hi Yury. >> >> 1. I am using Storm 0.9.5 >> 2. It is a BaseRichSpout. Yes, it has acking enabled and I ack each tuple >> at the end of the "execute" method of the bolt. I see tuples being acked in >> Storm UI. >> 3. Yes I observe memory usage increasing (which eventually leads to the >> topology hanging) even in my dummy setup which is not saving anything in >> memory, it merely reproduces the message-passing of my algorithm. I do not >> get OOM errors when I execute the topology on the cluster, but I get the >> most common exception in Storm :* java.lang.RuntimeException: >> java.lang.NullPointerException at >> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)* >> and some tasks die and the Storm UI statistics get lost/re-started. I have >> never profiled a topology that is being executed on the cluster so I am not >> very certain if this is what you mean. If I understand correctly, what you >> are suggesting is to take a heap dump using visualVM at some node while the >> topology is running and analyze this heap dump. >> 4. I haven't seen any GC logs (not sure how to collect GC logs from the >> cluster). >> >> Thanks again for your help. >> >> On Fri, Jan 15, 2016 at 9:57 PM, Yury Ruchin <[email protected]> >> wrote: >> >>> Hi Nick, >>> >>> Some questions: >>> >>> 1. Well, what version of Storm are you using? :) >>> >>> 2. What is the spout you are using? Is this spout reliable, i. e. does >>> it use message ids to have messages acked/failed by downstream bolts? Do >>> you have acker enabled for your topology? If it is unreliable or does not >>> have acker, then topology.max.spout.pending has no effect and if your bolts >>> don't keep up with your spout, you will likely end up with overflow buffer >>> growing larger and larger. >>> >>> 3. Not sure if I get it right: after you stopped saving anything in >>> memory - do you still experience memory usage increasing? Have you observed >>> OutOfMemoryErrors? If yes, you might want to launch your worker processes >>> with -XX:+HeapDumpOnOutOfMemoryError. If no, you can take on-demand heap >>> dump using e. g. VisualVM and feed it to a memory analyzer, such as MAT, >>> then take a look what eats up the heap. >>> >>> 4. What do you think it's a memory issue? Have you looked at GC graphs >>> shown by e. g. VisualVM? Did you collect any GC logs to see how long it >>> took? >>> >>> Regards >>> Yury >>> >>> 2016-01-15 20:15 GMT+03:00 Nikolaos Pavlakis <[email protected]> >>> : >>> >>>> Thanks for all the replies so far. I am profiling the topology in local >>>> mode with VisualVm and I do not see this problem. I am still running to >>>> this problem when the topology is deployed on the cluster, even with >>>> max.spout.pending = 1. >>>> >>>> On Wed, Jan 13, 2016 at 10:38 PM, John Yost <[email protected]> >>>> wrote: >>>> >>>>> +1 for Andrew, definitely agree profiling with jvisualvm or whatever >>>>> is definitely something to do if you have not done already >>>>> >>>>> On Wed, Jan 13, 2016 at 3:30 PM, Andrew Xor < >>>>> [email protected]> wrote: >>>>> >>>>>> Hey, >>>>>> >>>>>> Care to give version of storm/jvm? Does this happen on cluster >>>>>> execution only or when also running the topology in local mode? >>>>>> Unfortunately, probably the best way to find what's really going on is to >>>>>> profile your topology... if you can run the topology locally this will >>>>>> make >>>>>> things quite a bit easier as profiling storm topologies on a live cluster >>>>>> can be quite time consuming. >>>>>> >>>>>> Regards. >>>>>> >>>>>> On Wed, Jan 13, 2016 at 10:06 PM, Nikolaos Pavlakis < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I am implementing a distributed algorithm for pagerank estimation >>>>>>> using Storm. I have been having memory problems, so I decided to create >>>>>>> a >>>>>>> dummy implementation that does not explicitly save anything in memory, >>>>>>> to >>>>>>> determine whether the problem lies in my algorithm or my Storm >>>>>>> structure. >>>>>>> >>>>>>> Indeed, while the only thing the dummy implementation does is >>>>>>> message-passing (a lot of it), the memory of each worker process keeps >>>>>>> rising until the pipeline is clogged. I do not understand why this >>>>>>> might be >>>>>>> happening. >>>>>>> >>>>>>> My cluster has 18 machines (some with 8g, some 16g and some 32g of >>>>>>> memory). I have set the worker heap size to 6g (-Xmx6g). >>>>>>> >>>>>>> My topology is very very simple: >>>>>>> One spout >>>>>>> One bolt (with parallelism). >>>>>>> >>>>>>> The bolt receives data from the spout (fieldsGrouping) and also from >>>>>>> other tasks of itself. >>>>>>> >>>>>>> My message-passing pattern is based on random walks with a certain >>>>>>> stopping probability. More specifically: >>>>>>> The spout generates a tuple. >>>>>>> One specific task from the bolt receives this tuple. >>>>>>> Based on a certain probability, this task generates another tuple >>>>>>> and emits it again to another task of the same bolt. >>>>>>> >>>>>>> >>>>>>> I am stuck at this problem for quite a while, so it would be very >>>>>>> helpful if someone could help. >>>>>>> >>>>>>> Best Regards, >>>>>>> Nick >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
