Yes, I suggest you to try and spot the problem by looking at the dump of a workers that throws the exception. That way you could at least be certain about what consumes workers memory.
Oracle HotSpot has a number of options controlling GC logging, setting them for worker JVMs may help in troubleshooting. Plumbr's Handbook seems to be a decent reading on that matter: https://plumbr.eu/handbook. Since you are using custom spout, could you provide its code, at least the part that emits tuples? 2016-01-15 23:57 GMT+03:00 Nikolaos Pavlakis <[email protected]>: > Hi Yury. > > 1. I am using Storm 0.9.5 > 2. It is a BaseRichSpout. Yes, it has acking enabled and I ack each tuple > at the end of the "execute" method of the bolt. I see tuples being acked in > Storm UI. > 3. Yes I observe memory usage increasing (which eventually leads to the > topology hanging) even in my dummy setup which is not saving anything in > memory, it merely reproduces the message-passing of my algorithm. I do not > get OOM errors when I execute the topology on the cluster, but I get the > most common exception in Storm :* java.lang.RuntimeException: > java.lang.NullPointerException at > backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)* > and some tasks die and the Storm UI statistics get lost/re-started. I have > never profiled a topology that is being executed on the cluster so I am not > very certain if this is what you mean. If I understand correctly, what you > are suggesting is to take a heap dump using visualVM at some node while the > topology is running and analyze this heap dump. > 4. I haven't seen any GC logs (not sure how to collect GC logs from the > cluster). > > Thanks again for your help. > > On Fri, Jan 15, 2016 at 9:57 PM, Yury Ruchin <[email protected]> > wrote: > >> Hi Nick, >> >> Some questions: >> >> 1. Well, what version of Storm are you using? :) >> >> 2. What is the spout you are using? Is this spout reliable, i. e. does it >> use message ids to have messages acked/failed by downstream bolts? Do you >> have acker enabled for your topology? If it is unreliable or does not have >> acker, then topology.max.spout.pending has no effect and if your bolts >> don't keep up with your spout, you will likely end up with overflow buffer >> growing larger and larger. >> >> 3. Not sure if I get it right: after you stopped saving anything in >> memory - do you still experience memory usage increasing? Have you observed >> OutOfMemoryErrors? If yes, you might want to launch your worker processes >> with -XX:+HeapDumpOnOutOfMemoryError. If no, you can take on-demand heap >> dump using e. g. VisualVM and feed it to a memory analyzer, such as MAT, >> then take a look what eats up the heap. >> >> 4. What do you think it's a memory issue? Have you looked at GC graphs >> shown by e. g. VisualVM? Did you collect any GC logs to see how long it >> took? >> >> Regards >> Yury >> >> 2016-01-15 20:15 GMT+03:00 Nikolaos Pavlakis <[email protected]>: >> >>> Thanks for all the replies so far. I am profiling the topology in local >>> mode with VisualVm and I do not see this problem. I am still running to >>> this problem when the topology is deployed on the cluster, even with >>> max.spout.pending = 1. >>> >>> On Wed, Jan 13, 2016 at 10:38 PM, John Yost <[email protected]> >>> wrote: >>> >>>> +1 for Andrew, definitely agree profiling with jvisualvm or whatever is >>>> definitely something to do if you have not done already >>>> >>>> On Wed, Jan 13, 2016 at 3:30 PM, Andrew Xor < >>>> [email protected]> wrote: >>>> >>>>> Hey, >>>>> >>>>> Care to give version of storm/jvm? Does this happen on cluster >>>>> execution only or when also running the topology in local mode? >>>>> Unfortunately, probably the best way to find what's really going on is to >>>>> profile your topology... if you can run the topology locally this will >>>>> make >>>>> things quite a bit easier as profiling storm topologies on a live cluster >>>>> can be quite time consuming. >>>>> >>>>> Regards. >>>>> >>>>> On Wed, Jan 13, 2016 at 10:06 PM, Nikolaos Pavlakis < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I am implementing a distributed algorithm for pagerank estimation >>>>>> using Storm. I have been having memory problems, so I decided to create a >>>>>> dummy implementation that does not explicitly save anything in memory, to >>>>>> determine whether the problem lies in my algorithm or my Storm structure. >>>>>> >>>>>> Indeed, while the only thing the dummy implementation does is >>>>>> message-passing (a lot of it), the memory of each worker process keeps >>>>>> rising until the pipeline is clogged. I do not understand why this might >>>>>> be >>>>>> happening. >>>>>> >>>>>> My cluster has 18 machines (some with 8g, some 16g and some 32g of >>>>>> memory). I have set the worker heap size to 6g (-Xmx6g). >>>>>> >>>>>> My topology is very very simple: >>>>>> One spout >>>>>> One bolt (with parallelism). >>>>>> >>>>>> The bolt receives data from the spout (fieldsGrouping) and also from >>>>>> other tasks of itself. >>>>>> >>>>>> My message-passing pattern is based on random walks with a certain >>>>>> stopping probability. More specifically: >>>>>> The spout generates a tuple. >>>>>> One specific task from the bolt receives this tuple. >>>>>> Based on a certain probability, this task generates another tuple and >>>>>> emits it again to another task of the same bolt. >>>>>> >>>>>> >>>>>> I am stuck at this problem for quite a while, so it would be very >>>>>> helpful if someone could help. >>>>>> >>>>>> Best Regards, >>>>>> Nick >>>>>> >>>>> >>>>> >>>> >>> >> >
