It could be related to ulimit on your machines. A good number to start around is 65000 for ulimit.
On Tue, Jun 10, 2014, at 10:40 AM, Sean Allen wrote: On a 0.9.0.1 cluster. Everything was fine until last week. No changes were made and we now regularly have nodes dying where we end up with the following exception. Note, number of open files is really low, we aren't out of file handles. Has anyone else encountered this? 2014-06-10 13:34:04 b.s.d.worker [ERROR] Error when processing event java.io.FileNotFoundException: /opt/storm/var/storm/workers/b9ec5518-9430-4275-9844-e2f6e203e3ce/heart beats/1402421644201 (Too many open files) at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_17] at java.io.FileOutputStream.<init>(FileOutputStream.java:212) ~[na:1.7.0_17] at java.io.FileOutputStream.<init>(FileOutputStream.java:165) ~[na:1.7.0_17] at org.apache.commons.io.FileUtils.openOutputStream(FileUtils.java:179) ~[commons-io-1.4.jar:1.4] at org.apache.commons.io.FileUtils.writeByteArrayToFile(FileUtils.java:128 2) ~[commons-io-1.4.jar:1.4] at backtype.storm.utils.LocalState.persist(LocalState.java:69) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.utils.LocalState.put(LocalState.java:49) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.daemon.worker$do_heartbeat.invoke(worker.clj:51) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.daemon.worker$fn__5882$exec_fn__1229__auto____5883$heart beat_fn__5884.invoke(worker.clj:339) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26) [storm-core-0.9.0.1.jar:na] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:722) [na:1.7.0_17] -- Ce n'est pas une signature
