[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139488#comment-16139488 ] lishuming commented on SPARK-17321: --- [~jerryshao] I agree with what you said, however there are some questions in my mind: 1. The current chosen strategy is puzzling somehow, because both `yarn.nodemanager.local-dirs` and `NM recovery path` are available to choose to store leveldb now, so we can always pick an available disk to store, avoiding the disk problem(https://github.com/apache/spark/pull/18905#issuecomment-323287272). 2. If as [~jerryshao] said, `yarn.nodemanager.local-dirs` should not be used whenever NM recovery is enabled or not, am I right ? 3. Can someone check that If we don't use leveldb, ShuffleService which uses `Map` will affect NM's memory or something else? > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0, 2.1.1 >Reporter: yunjiong zhao > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139482#comment-16139482 ] lishuming commented on SPARK-21733: --- [~1028344...@qq.com] Maybe you should check the executor's log to find some exceptions... > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21660) Yarn ShuffleService failed to start when the chosen directory become read-only
[ https://issues.apache.org/jira/browse/SPARK-21660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139471#comment-16139471 ] lishuming commented on SPARK-21660: --- Sorry, this is a dup of [SPARK-17321|https://issues.apache.org/jira/browse/SPARK-17321], I will comment there. > Yarn ShuffleService failed to start when the chosen directory become read-only > -- > > Key: SPARK-21660 > URL: https://issues.apache.org/jira/browse/SPARK-21660 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 2.1.1 >Reporter: lishuming > > h3. Background > In our production environment,disks corrupt to `read-only` status almost once > a month. Now the strategy of Yarn ShuffleService which chooses an available > directory(disk) to store Shuffle info(DB) is as > below(https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L340): > 1. If NameNode's recoveryPath not empty and shuffle DB exists in the > recoveryPath, return the recoveryPath; > 2. If recoveryPath empty and shuffle DB exists in > `yarn.nodemanager.local-dirs`, set recoveryPath as the existing DB path and > return the path; > 3. If recoveryPath not empty(shuffle DB not exists in the path) and shuffle > DB exists in `yarn.nodemanager.local-dirs`, mv the existing shuffle DB to > recoveryPath and return the path; > 4. If all above don't hit, we choose the first disk of > `yarn.nodemanager.local-dirs`as the recoveryPath; > All above strategy don't consider the chosen disk(directory) is writable or > not, so in our environment we meet such exception: > {code:java} > 2017-06-25 07:15:43,512 ERROR org.apache.spark.network.util.LevelDBProvider: > error opening leveldb file /mnt/dfs/12/yarn/local/registeredExecutors.ldb. > Creating new file, will not be able to recover state for existing applications > at > org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167) > 2017-06-25 07:15:43,514 WARN org.apache.spark.network.util.LevelDBProvider: > error deleting /mnt/dfs/12/yarn/local/registeredExecutors.ldb > 2017-06-25 07:15:43,515 INFO org.apache.hadoop.service.AbstractService: > Service spark_shuffle failed in state INITED; cause: java.io.IOException: > Unable to create state store > at > org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167) > at > org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75) > {code} > h3. Consideration > 1. For many production environment, `yarn.nodemanager.local-dirs` always has > more than 1 disk, so we can make a better chosen strategy to avoid the > problem above; > 2. Can we add a strategy to check the DB directory we choose is writable, so > avoid the problem above? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21761) [Core] Add the application's final state for SparkListenerApplicationEnd event
lishuming created SPARK-21761: - Summary: [Core] Add the application's final state for SparkListenerApplicationEnd event Key: SPARK-21761 URL: https://issues.apache.org/jira/browse/SPARK-21761 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.1.1 Reporter: lishuming Priority: Minor When add an extra `SparkListener`, we want to get the application's final state which I think is necessary to record. Maybe we can change `SparkListenerApplicationEnd` as below: {code:java} case class SparkListenerApplicationEnd(time: Long, sparkUser: String) extends SparkListenerEvent import org.apache.spark.launcher.SparkAppHandle.State case class SparkListenerApplicationEnd(time: Long, sparkUser: String, status: State) extends SparkListenerEvent {code} Of course, we should add some implements to this change for different deployed mode. Can someone give me some advice? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21660) Yarn ShuffleService failed to start when the chosen directory become read-only
lishuming created SPARK-21660: - Summary: Yarn ShuffleService failed to start when the chosen directory become read-only Key: SPARK-21660 URL: https://issues.apache.org/jira/browse/SPARK-21660 Project: Spark Issue Type: Bug Components: Shuffle, YARN Affects Versions: 2.1.1 Reporter: lishuming h3. Background In our production environment,disks corrupt to `read-only` status almost once a month. Now the strategy of Yarn ShuffleService which chooses an available directory(disk) to store Shuffle info(DB) is as below(https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L340): 1. If NameNode's recoveryPath not empty and shuffle DB exists in the recoveryPath, return the recoveryPath; 2. If recoveryPath empty and shuffle DB exists in `yarn.nodemanager.local-dirs`, set recoveryPath as the existing DB path and return the path; 3. If recoveryPath not empty(shuffle DB not exists in the path) and shuffle DB exists in `yarn.nodemanager.local-dirs`, mv the existing shuffle DB to recoveryPath and return the path; 4. If all above don't hit, we choose the first disk of `yarn.nodemanager.local-dirs`as the recoveryPath; All above strategy don't consider the chosen disk(directory) is writable or not, so in our environment we meet such exception: {code:java} 2017-06-25 07:15:43,512 ERROR org.apache.spark.network.util.LevelDBProvider: error opening leveldb file /mnt/dfs/12/yarn/local/registeredExecutors.ldb. Creating new file, will not be able to recover state for existing applications at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66) at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167) 2017-06-25 07:15:43,514 WARN org.apache.spark.network.util.LevelDBProvider: error deleting /mnt/dfs/12/yarn/local/registeredExecutors.ldb 2017-06-25 07:15:43,515 INFO org.apache.hadoop.service.AbstractService: Service spark_shuffle failed in state INITED; cause: java.io.IOException: Unable to create state store at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66) at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167) at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75) {code} h3. Consideration 1. For many production environment, `yarn.nodemanager.local-dirs` always has more than 1 disk, so we can make a better chosen strategy to avoid the problem above; 2. Can we add a strategy to check the DB directory we choose is writable, so avoid the problem above? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"
[ https://issues.apache.org/jira/browse/SPARK-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lishuming updated SPARK-21487: -- Description: We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors Page` becomed empty, with the exception below. `Executor Page` rendering using javascript language rather than scala in 2.1.1, but I don't know why causes this result? "two queries are submitted at the same time and have the same timestamp may cause this result", but I'm not sure? ResouceManager log: {code:java} 2017-07-20 20:39:09,371 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)) {code} Safari explorer console {code:java} Failed to load resource: the server responded with a status of 403 (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html {code} Related Links: https://issues.apache.org/jira/browse/HIVE-12481 https://issues.apache.org/jira/browse/HADOOP-8830 was: We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors Page` becomed empty, with the exception below. `Executor Page` rendering using javascript language rather than scala in 2.1.1, but I don't know why causes this result? "two queries are submitted at the same time and have the same timestamp may cause this result", but I'm not sure? ResouceManager log: {code:java} 2017-07-20 20:39:09,371 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)) {code} Safari explorer console {code:java} Failed to load resource: the server responded with a status of 403 (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)))http://hadoop280.lt.163.org:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html {code} Recent Links: https://issues.apache.org/jira/browse/HIVE-12481 https://issues.apache.org/jira/browse/HADOOP-8830 > WebUI-Executors Page results in "Request is a replay (34) attack" > - > > Key: SPARK-21487 > URL: https://issues.apache.org/jira/browse/SPARK-21487 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: lishuming >Priority: Minor > > We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors > Page` becomed empty, with the exception below. > `Executor Page` rendering using javascript language rather than scala in > 2.1.1, but I don't know why causes this result? > "two queries are submitted at the same time and have the same timestamp may > cause this result", but I'm not sure? > ResouceManager log: > {code:java} > 2017-07-20 20:39:09,371 WARN > org.apache.hadoop.security.authentication.server.AuthenticationFilter: > Authentication exception: GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > {code} > Safari explorer console > {code:java} > Failed to load resource: the server responded with a status of 403 > (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request > is a replay > (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html > {code} > Related Links: > https://issues.apache.org/jira/browse/HIVE-12481 > https://issues.apache.org/jira/browse/HADOOP-8830 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"
lishuming created SPARK-21487: - Summary: WebUI-Executors Page results in "Request is a replay (34) attack" Key: SPARK-21487 URL: https://issues.apache.org/jira/browse/SPARK-21487 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.1 Reporter: lishuming Priority: Minor We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors Page` becomed empty, with the exception below. `Executor Page` rendering using javascript language rather than scala in 2.1.1, but I don't know why causes this result? "two queries are submitted at the same time and have the same timestamp may cause this result", but I'm not sure? ResouceManager log: {code:java} 2017-07-20 20:39:09,371 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)) {code} Safari explorer console {code:java} Failed to load resource: the server responded with a status of 403 (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)))http://hadoop280.lt.163.org:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html {code} Recent Links: https://issues.apache.org/jira/browse/HIVE-12481 https://issues.apache.org/jira/browse/HADOOP-8830 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org