I upped the ulimit to 128k files on all nodes. Job crashed again with
DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275.
Couldn't get the logs because I killed the job and looks like yarn
wipe the container logs (not sure why it wipes the logs under
/var/log/hadoop-yarn/container).
Do you see this error right in the beginning or after running for sometime?
The root cause seems to be that somehow your Spark executors got killed,
which killed receivers and caused further errors. Please try to take a look
at the executor logs of the lost executor to find what is the root cause
Appeared after running for a while. I re-ran the job and this time, it
crashed with:
14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for
stream 0: Error in block pushing thread - java.net.SocketException: Too
many open files
Shouldn't the failed receiver get re-spawned on a
It did. It got failed and respawned 4 times.
In this case, the too many open files is a sign that you need increase the
system-wide limit of open files.
Try adding ulimit -n 16000 to your conf/spark-env.sh.
TD
On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith secs...@gmail.com wrote:
Appeared after