Yes, I suggest you to try and spot the problem by looking at the dump of a
workers that throws the exception. That way you could at least be certain
about what consumes workers memory.

Oracle HotSpot has a number of options controlling GC logging, setting them
for worker JVMs may help in troubleshooting. Plumbr's Handbook seems to be
a decent reading on that matter: https://plumbr.eu/handbook.

Since you are using custom spout, could you provide its code, at least the
part that emits tuples?

2016-01-15 23:57 GMT+03:00 Nikolaos Pavlakis <[email protected]>:

> Hi Yury.
>
> 1. I am using Storm 0.9.5
> 2. It is a BaseRichSpout. Yes, it has acking enabled and I ack each tuple
> at the end of the "execute" method of the bolt. I see tuples being acked in
> Storm UI.
> 3. Yes I observe memory usage increasing (which eventually leads to the
> topology hanging) even in my dummy setup which is not saving anything in
> memory, it merely reproduces the message-passing of my algorithm. I do not
> get OOM errors when I execute the topology on the cluster, but I get the
> most common exception in Storm :* java.lang.RuntimeException:
> java.lang.NullPointerException at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)*
> and some tasks die and the Storm UI statistics get lost/re-started. I have
> never profiled a topology that is being executed on the cluster so I am not
> very certain if this is what you mean. If I understand correctly, what you
> are suggesting is to take a heap dump using visualVM at some node while the
> topology is running and analyze this heap dump.
> 4. I haven't seen any GC logs (not sure how to collect GC logs from the
> cluster).
>
> Thanks again for your help.
>
> On Fri, Jan 15, 2016 at 9:57 PM, Yury Ruchin <[email protected]>
> wrote:
>
>> Hi Nick,
>>
>> Some questions:
>>
>> 1. Well, what version of Storm are you using? :)
>>
>> 2. What is the spout you are using? Is this spout reliable, i. e. does it
>> use message ids to have messages acked/failed by downstream bolts? Do you
>> have acker enabled for your topology? If it is unreliable or does not have
>> acker, then topology.max.spout.pending has no effect and if your bolts
>> don't keep up with your spout, you will likely end up with overflow buffer
>> growing larger and larger.
>>
>> 3. Not sure if I get it right: after you stopped saving anything in
>> memory - do you still experience memory usage increasing? Have you observed
>> OutOfMemoryErrors? If yes, you might want to launch your worker processes
>> with -XX:+HeapDumpOnOutOfMemoryError. If no, you can take on-demand heap
>> dump using e. g. VisualVM and feed it to a memory analyzer, such as MAT,
>> then take a look what eats up the heap.
>>
>> 4. What do you think it's a memory issue? Have you looked at GC graphs
>> shown by e. g. VisualVM? Did you collect any GC logs to see how long it
>> took?
>>
>> Regards
>> Yury
>>
>> 2016-01-15 20:15 GMT+03:00 Nikolaos Pavlakis <[email protected]>:
>>
>>> Thanks for all the replies so far. I am profiling the topology in local
>>> mode with VisualVm and I do not see this problem. I am still running to
>>> this problem when the topology is deployed on the cluster, even with
>>> max.spout.pending = 1.
>>>
>>> On Wed, Jan 13, 2016 at 10:38 PM, John Yost <[email protected]>
>>> wrote:
>>>
>>>> +1 for Andrew, definitely agree profiling with jvisualvm or whatever is
>>>> definitely something to do if you have not done already
>>>>
>>>> On Wed, Jan 13, 2016 at 3:30 PM, Andrew Xor <
>>>> [email protected]> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>>  Care to give version of storm/jvm? Does this happen on cluster
>>>>> execution only or when also running the topology in local mode?
>>>>> Unfortunately, probably the best way to find what's really going on is to
>>>>> profile your topology... if you can run the topology locally this will 
>>>>> make
>>>>> things quite a bit easier as profiling storm topologies on a live cluster
>>>>> can be quite time consuming.
>>>>>
>>>>> Regards.
>>>>>
>>>>> On Wed, Jan 13, 2016 at 10:06 PM, Nikolaos Pavlakis <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am implementing a distributed algorithm for pagerank estimation
>>>>>> using Storm. I have been having memory problems, so I decided to create a
>>>>>> dummy implementation that does not explicitly save anything in memory, to
>>>>>> determine whether the problem lies in my algorithm or my Storm structure.
>>>>>>
>>>>>> Indeed, while the only thing the dummy implementation does is
>>>>>> message-passing (a lot of it), the memory of each worker process keeps
>>>>>> rising until the pipeline is clogged. I do not understand why this might 
>>>>>> be
>>>>>> happening.
>>>>>>
>>>>>> My cluster has 18 machines (some with 8g, some 16g and some 32g of
>>>>>> memory). I have set the worker heap size to 6g (-Xmx6g).
>>>>>>
>>>>>> My topology is very very simple:
>>>>>> One spout
>>>>>> One bolt (with parallelism).
>>>>>>
>>>>>> The bolt receives data from the spout (fieldsGrouping) and also from
>>>>>> other tasks of itself.
>>>>>>
>>>>>> My message-passing pattern is based on random walks with a certain
>>>>>> stopping probability. More specifically:
>>>>>> The spout generates a tuple.
>>>>>> One specific task from the bolt receives this tuple.
>>>>>> Based on a certain probability, this task generates another tuple and
>>>>>> emits it again to another task of the same bolt.
>>>>>>
>>>>>>
>>>>>> I am stuck at this problem for quite a while, so it would be very
>>>>>> helpful if someone could help.
>>>>>>
>>>>>> Best Regards,
>>>>>> Nick
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to