Hi,
This is a 4 node hadoop cluster running on CentOS 6.3 with Oracle JDK (64bit)
1.6.0_43. Each node has 32G memory, with max 8 mapper tasks and 4 reducer tasks
being set. The hadoop version is 1.0.4.
This is setup on Datastax DES 3.0.2, which is using Cassandra CFS as underline
DFS, instead of HDFS with NameNode. I understand this kind of setting is not
really being tested with hadoop MR, but the above MR errors should not relate
to it, at least from my guess.
I am running a simple MR job, partition data by DATE for 700G of 600 files. The
MR logic is very straightforward, but in our above staging environment, I saw a
lot of Reducers failed with the above error. I want to know the reason and fix
it.
1) There is no log related to this error in the reducer task attempt log in
user log directory. The only log related to this is in the system.log, which
generated by cassandra processor: INFO [JVM Runner
jvm_201308141528_0003_r_625176200 spawned.] 2013-08-15 07:28:59,326
JvmManager.java (line 510) JVM : jvm_201308141528_0003_r_625176200 exited with
exit code -1. Number of tasks it ran: 0
2) I believe this error is related to the system resource, but just cannot
google anything to be the root cause. From the log, I believe the JVM
terminated/crashed for the reducer task, but I don't know the reason.
3) I checked the limits of the user which process is running under, here is the
info, and I didn't spot any obvious problems.-bash-4.1$ ulimit -acore file size
(blocks, -c) 0data seg size (kbytes, -d) unlimitedscheduling
priority (-e) 0file size (blocks, -f)
unlimitedpending signals (-i) 256589max locked memory
(kbytes, -l) unlimitedmax memory size (kbytes, -m) unlimitedopen files
(-n) 400000pipe size (512 bytes, -p) 8POSIX
message queues (bytes, -q) 819200real-time priority (-r)
0stack size (kbytes, -s) 10240cpu time (seconds, -t)
unlimitedmax user processes (-u) 32768virtual memory
(kbytes, -v) unlimitedfile locks (-x) unlimited
4) Since this is a new cluster, there is really not too much hadoop setting
changed from the default value. I did run the reducer as '-mx2048m', to set the
heap size of JVM to 2G, as 1st time the reducers failed with OOM error. I
google around, as it looks like people recommend to set "mapred.child.ulimit"
to 3x of heap size, which should be around 6G in this case. I can give that a
try, but in the nodes, the virtual memory is set to unlimited for user whom is
running under, so I am not sure if this will really fix it.
5) Another possibility I found in google is that the child process return -1
when it failed to write to user logs, as Linux EXT3 has a limitation about how
many file/directories can be created under one folder (32k?). But my system is
using EXT4, and there is not too many MR jobs running so far.
6) I am really not sure what is the root cause of this, as exit code -1 could
mean a lot. But I wonder any one here can give me more hints, or any help about
debugging this issue in my environment? Is there any way in hapoop or JVM
setting I can set to dump more info/log about why the JVM terminated at runtime
with exit code -1?
Thanks
Yong