[ https://issues.apache.org/jira/browse/SPARK-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411159#comment-15411159 ]
Saisai Shao edited comment on SPARK-16914 at 8/8/16 1:48 AM: ------------------------------------------------------------- So from your description, is this exception mainly due to the problem of disk1 that leveldb fail to write data into it? Maybe this JIRA SPARK-14963 could address your problem, it uses NM's recovery dir to store aux-service data. And I guess NM will handle this disk failure problem if you configure multiple disks for NM local dir. was (Author: jerryshao): So from your description, is this exception mainly due to the problem of disk1 that leveldb fail to write data into it? Maybe this JIRA SPARK-16917 could address your problem, it uses NM's recovery dir to store aux-service data. And I guess NM will handle this disk failure problem if you configure multiple disks for NM local dir. > NodeManager crash when spark are registering executor infomartion into leveldb > ------------------------------------------------------------------------------ > > Key: SPARK-16914 > URL: https://issues.apache.org/jira/browse/SPARK-16914 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 1.6.2 > Reporter: cen yuhai > > {noformat} > Stack: [0x00007fb5b53de000,0x00007fb5b54df000], sp=0x00007fb5b54dcba8, free > space=1018k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > C [libc.so.6+0x896b1] memcpy+0x11 > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) > j > org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Put(JLorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)J+0 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)V+11 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)V+18 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;[B[B)V+36 > j > org.fusesource.leveldbjni.internal.JniDB.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+28 > j org.fusesource.leveldbjni.internal.JniDB.put([B[B)V+10 > j > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(Ljava/lang/String;Ljava/lang/String;Lorg/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo;)V+61 > J 8429 C2 > org.apache.spark.network.server.TransportRequestHandler.handle(Lorg/apache/spark/network/protocol/RequestMessage;)V > (100 bytes) @ 0x00007fb5f27ff6cc [0x00007fb5f27fdde0+0x18ec] > J 8371 C2 > org.apache.spark.network.server.TransportChannelHandler.channelRead0(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (10 bytes) @ 0x00007fb5f242df20 [0x00007fb5f242de80+0xa0] > J 6853 C2 > io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (74 bytes) @ 0x00007fb5f215587c [0x00007fb5f21557e0+0x9c] > J 5872 C2 > io.netty.handler.timeout.IdleStateHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (42 bytes) @ 0x00007fb5f2183268 [0x00007fb5f2183100+0x168] > J 5849 C2 > io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (158 bytes) @ 0x00007fb5f2191524 [0x00007fb5f218f5a0+0x1f84] > J 5941 C2 > org.apache.spark.network.util.TransportFrameDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (170 bytes) @ 0x00007fb5f220a230 [0x00007fb5f2209fc0+0x270] > J 7747 C2 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read()V > (363 bytes) @ 0x00007fb5f264465c [0x00007fb5f2644140+0x51c] > J 8008% C2 io.netty.channel.nio.NioEventLoop.run()V (162 bytes) @ > 0x00007fb5f26f6764 [0x00007fb5f26f63c0+0x3a4] > j io.netty.util.concurrent.SingleThreadEventExecutor$2.run()V+13 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > {noformat} > The target code in spark is in ExternalShuffleBlockResolver > {code} > /** Registers a new Executor with all the configuration we need to find its > shuffle files. */ > public void registerExecutor( > String appId, > String execId, > ExecutorShuffleInfo executorInfo) { > AppExecId fullId = new AppExecId(appId, execId); > logger.info("Registered executor {} with {}", fullId, executorInfo); > try { > if (db != null) { > byte[] key = dbAppExecKey(fullId); > byte[] value = > mapper.writeValueAsString(executorInfo).getBytes(Charsets.UTF_8); > db.put(key, value); > } > } catch (Exception e) { > logger.error("Error saving registered executors", e); > } > executors.put(fullId, executorInfo); > } > {code} > There is a problem with disk1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org