Did you try something like -Dmapred.child.java.opts="-Xss64k? (see GIRAPH-12)
Christian On Oct 10, 2011, at 11:08 AM, Zhiwei Gu wrote: > Hi all, > In my giraph job, when I set the worker to be 200, it is ok, and while set > to 500, it will fail due to early stage OOM exception in one (or more) > workers. As this worker fails, other workers who wants to talk with this > worker will keep on waiting until tried 5 times, then that worker will fail. > > Have you ever faced such issue? > > Best, > -z > > > Here is the exception, > 2011-10-08 09:26:59,108 INFO org.apache.giraph.comm.RPCCommunications: > getRPCServer: Added jobToken Ident: 17 6a 6f 62 5f 32 30 31 31 30 38 32 36 30 > 39 31 31 5f 36 36 37 30 39 30, Pass: 12 26 1a f1 d2 51 e1 bf 2d 36 63 11 26 > 18 17 3d 53 b3 15 f6, Kind: mapreduce.job, Service: job_201108260911_667090 > 2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting > SocketReader > 2011-10-08 09:26:59,120 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > RpcDetailedActivityForPort31250 registered. > 2011-10-08 09:26:59,121 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > RpcActivityForPort31250 registered. > 2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server > Responder: starting > 2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server > listener on 31250: starting > 2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 0 on 31250: starting > 2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 1 on 31250: starting > 2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 2 on 31250: starting > 2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 3 on 31250: starting > 2011-10-08 09:26:59,137 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 4 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 6 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 7 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 8 on 31250: starting > 2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 9 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 10 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 11 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 12 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 13 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 14 on 31250: starting > 2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 15 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 16 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 17 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 18 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 19 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 20 on 31250: starting > 2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 21 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 22 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 23 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 24 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 25 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 26 on 31250: starting > 2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 27 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 28 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 29 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 30 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 31 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 32 on 31250: starting > 2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 33 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 34 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 35 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 36 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 37 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 38 on 31250: starting > 2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 39 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 40 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 41 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 42 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 43 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 44 on 31250: starting > 2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 45 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 46 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 47 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 48 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 49 on 31250: starting > 2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 50 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 51 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 52 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 53 on 31250: starting > 2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 54 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 55 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 56 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 57 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 58 on 31250: starting > 2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 59 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 60 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 61 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 62 on 31250: starting > 2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 63 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 64 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 65 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 66 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 67 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 68 on 31250: starting > 2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 69 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 70 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 71 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 72 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 73 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 74 on 31250: starting > 2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 75 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 76 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 77 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 78 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 79 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 80 on 31250: starting > 2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 81 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 82 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 83 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 84 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 85 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 86 on 31250: starting > 2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 87 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 88 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 89 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 90 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 91 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 92 on 31250: starting > 2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 93 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 94 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 95 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 96 on 31250: starting > 2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 97 on 31250: starting > 2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 98 on 31250: starting > 2011-10-08 09:26:59,161 INFO org.apache.giraph.comm.BasicRPCCommunications: > BasicRPCCommunications: Started RPC communication server: > gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250 with 100 handlers > 2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 99 on 31250: starting > 2011-10-08 09:27:05,234 INFO org.apache.hadoop.mapred.TaskLogsTruncater: > Initializing logs' truncater with mapRetainSize=102400 and > reduceRetainSize=102400 > 2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running > child : java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:597) > at java.lang.UNIXProcess$1.run(UNIXProcess.java:141) > at java.security.AccessController.doPrivileged(Native Method) > at java.lang.UNIXProcess.<init>(UNIXProcess.java:103) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:200) > at org.apache.hadoop.util.Shell.run(Shell.java:182) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) > at > org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540) > at > org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37) > at > org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417) > at > org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400) > at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275) > at > org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124) > at org.apache.hadoop.mapred.Child$4.run(Child.java:266) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:255) > > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics > system... > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source > ugi(org.apache.hadoop.security.UgiInstrumentation) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source > jvm(org.apache.hadoop.metrics2.source.JvmMetricsSource) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source > RpcDetailedActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation$Detailed) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source > RpcActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation) > 2011-10-08 09:27:05,272 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system > stopped. > > -- > Best Regards > Zhiwei Gu