This is a know issue with how storm stores its metrics. The metrics that appear on the UI for each worker are stored in zookeeper in a per worker location. So if a worker crashes for some reason and then comes back up again the metrics for that worker are reset to 0. This is why the metrics go backwards. We have plans to fix this, but they are still a work in progress.
The issue you are seeing from your exception is that you are running out of heap space in your worker. Java will throw an OutOfMemoryError when either there is not enough memory on the heap to hold the request you are making, or when the JVM spent a over 98% of the time doing GC trying to find enough memory to make progress. The exception you hit is the latter. This may mean a number of things. You have 768MB as the default heap size in storm.yaml so unless you are overriding it when you launch your topology that is the heap size. I don't know if you are using RAS for scheduling or not, but if not then you don't need to set supervisor.memory.capacity.mb or worker.heap.memory.mb. Common issues I have seen that make a worker run out of memory. 1) Too many executors in a worker. If you are not using the RAS scheduler the heap size of your worker is hard coded and if you have too many bolts and spouts in a single worker it can easily run out of memory. To fix this either launch more workers or increase the heap size by setting topology.worker.childopts. 2) Your topology cannot keep up with incoming data and you haven't set topology.max.spout.pending. Max spout pending allows for flow control in your topology when acking is enabled. Without it if the input data is coming faster then your topology can process it, that data is stored in memory. This can quickly get out of hand and your workers can crash. You can typically see this when using the logging metrics consumer and you will see the receive queue population metric. If it is large close to the capacity metric then you know the buffer is filing up. The best way to fix this is to either enable flow control with max spout pending or to increase the parallelism of components that are slow. But this can make you run into the situation above if you don't also increase the number of workers or heap size too. 3) You might have a memory leak in your code. If that is the case then you need to do a heap dump of one of the workers that is in trouble. There are ways to do a heap dump on an out of memory error too by changing the worker options. If you see lots of things inside of a disruptor queue then it is probably issue 2. I hope this helps. As for the NPEs you really need to look at the caused by for anything coming from the disruptor queue as it is just wrapping the exception that was likely thrown by something in your topology. - Bobby On Thu, Oct 26, 2017 at 5:04 AM Subarna Chatterjee <[email protected]> wrote: > Hello, > > I am deploying a storm topology on a clustered environment with one master > and two supervisor nodes. The topology works fine for sometime and then the > no of tuples emitted decreases and I am getting two different errors: > > java.lang.RuntimeException: java.lang.NullPointerException at > backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:135) > at backtype.storm.utils.DisruptorQueue.consumeBatchWhe > > and > > apache storm java.lang.OutOfMemoryError: GC overhead limit exceeded at > java.lang.reflect.Method.copy(Method.java:151) at > java.lang.reflect.ReflectAccess.copyMethod(ReflectAccess.java:136) at > sun.reflect.Reflect > > In my storm.yaml, the content is as follows: > > storm.zookeeper.servers: > - "zoo01" > storm.zookeeper.port: 2181 > > nimbus.host: "nimbus01" > nimbus.thrift.port: 6627 > > nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true" > > ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true" > > supervisor.childopts: "-Djava.net.preferIPv4Stack=true" > > worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true" > > storm.local.dir: "/usr/local/storm/data" > > java.library.path: "/usr/lib/jvm/java-7-openjdk-amd64/" > > supervisor.memory.capacity.mb: 32000 > > worker.heap.memory.mb: 2000 > > supervisor.slots.ports: > > - 6700 > - 6701 > - 6702 > - 6703 > > Can anyone help? > > Thanks a lot! :) > > *Thanking you,* > > > *Dr. Subarna Chatterjee, Ph.D.* > > *Post-Doctoral Researcher* > *Inria, Rennes* > *Website: > <http://goog_1688852758>http://chatterjeesubarna.wix.com/subarna > <http://chatterjeesubarna.wix.com/subarna>* >
