[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS
[ https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1285#comment-1285 ] Apache Spark commented on SPARK-25778: -- User 'gss2002' has created a pull request for this issue: https://github.com/apache/spark/pull/22867 > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to > tmpDir from $PWD to HDFS > - > > Key: SPARK-25778 > URL: https://issues.apache.org/jira/browse/SPARK-25778 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, YARN >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, > 2.3.2 >Reporter: Greg Senia >Priority: Major > > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to > HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster > Mode for Spark > While attempting to use Spark Streaming and WriteAheadLogs. I noticed the > following errors after the driver attempted to recovery the already read data > that was being written to HDFS in the checkpoint folder. After spending many > hours looking at the cause of the following error below due to the fact the > parent folder /hadoop exists in our HDFS FS.. I am wonder if its possible to > make an option configurable to choose an alternate bogus directory that will > never be used. > hadoop fs -ls / > drwx-- - dsadmdsadm 0 2017-06-20 13:20 /hadoop > hadoop fs -ls /hadoop/apps > drwx-- - dsadm dsadm 0 2017-06-20 13:20 /hadoop/apps > streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala > val nonExistentDirectory = new File( > System.getProperty("java.io.tmpdir"), > UUID.randomUUID().toString).getAbsolutePath > writeAheadLog = WriteAheadLogUtils.createLogForReceiver( > SparkEnv.get.conf, nonExistentDirectory, hadoopConf) > dataRead = writeAheadLog.read(partition.walRecordHandle) > 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching > task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com. > 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 > as bytes > 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is > StorageLevel(disk, memory, 1 replicas) > 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory > on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB) > 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, > ha20t5002dn.tech.hdp.example.com, executor 1): > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.AccessControlException: Permission > denied: user=hdpdevspark, access=EXECUTE, > inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at >
[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS
[ https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1284#comment-1284 ] Apache Spark commented on SPARK-25778: -- User 'gss2002' has created a pull request for this issue: https://github.com/apache/spark/pull/22867 > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to > tmpDir from $PWD to HDFS > - > > Key: SPARK-25778 > URL: https://issues.apache.org/jira/browse/SPARK-25778 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, YARN >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, > 2.3.2 >Reporter: Greg Senia >Priority: Major > > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to > HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster > Mode for Spark > While attempting to use Spark Streaming and WriteAheadLogs. I noticed the > following errors after the driver attempted to recovery the already read data > that was being written to HDFS in the checkpoint folder. After spending many > hours looking at the cause of the following error below due to the fact the > parent folder /hadoop exists in our HDFS FS.. I am wonder if its possible to > make an option configurable to choose an alternate bogus directory that will > never be used. > hadoop fs -ls / > drwx-- - dsadmdsadm 0 2017-06-20 13:20 /hadoop > hadoop fs -ls /hadoop/apps > drwx-- - dsadm dsadm 0 2017-06-20 13:20 /hadoop/apps > streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala > val nonExistentDirectory = new File( > System.getProperty("java.io.tmpdir"), > UUID.randomUUID().toString).getAbsolutePath > writeAheadLog = WriteAheadLogUtils.createLogForReceiver( > SparkEnv.get.conf, nonExistentDirectory, hadoopConf) > dataRead = writeAheadLog.read(partition.walRecordHandle) > 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching > task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com. > 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 > as bytes > 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is > StorageLevel(disk, memory, 1 replicas) > 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory > on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB) > 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, > ha20t5002dn.tech.hdp.example.com, executor 1): > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.AccessControlException: Permission > denied: user=hdpdevspark, access=EXECUTE, > inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at >
[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access
[ https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658934#comment-16658934 ] Greg Senia commented on SPARK-25778: [~hyukjin.kwon] thanks for the heads up on priority. Never realized that but appreciate it. Any thoughts on a property for this is a -D best or a new spark property? Thanks > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access > - > > Key: SPARK-25778 > URL: https://issues.apache.org/jira/browse/SPARK-25778 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, YARN >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, > 2.3.2 >Reporter: Greg Senia >Priority: Major > > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to > HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster > Mode for Spark > While attempting to use Spark Streaming and WriteAheadLogs. I noticed the > following errors after the driver attempted to recovery the already read data > that was being written to HDFS in the checkpoint folder. After spending many > hours looking at the cause of the following error below due to the fact the > parent folder /hadoop exists in our HDFS FS.. I am wonder if its possible to > make an option configurable to choose an alternate bogus directory that will > never be used. > hadoop fs -ls / > drwx-- - dsadmdsadm 0 2017-06-20 13:20 /hadoop > hadoop fs -ls /hadoop/apps > drwx-- - dsadm dsadm 0 2017-06-20 13:20 /hadoop/apps > streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala > val nonExistentDirectory = new File( > System.getProperty("java.io.tmpdir"), > UUID.randomUUID().toString).getAbsolutePath > writeAheadLog = WriteAheadLogUtils.createLogForReceiver( > SparkEnv.get.conf, nonExistentDirectory, hadoopConf) > dataRead = writeAheadLog.read(partition.walRecordHandle) > 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching > task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com. > 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 > as bytes > 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is > StorageLevel(disk, memory, 1 replicas) > 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory > on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB) > 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, > ha20t5002dn.tech.hdp.example.com, executor 1): > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.AccessControlException: Permission > denied: user=hdpdevspark, access=EXECUTE, > inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at >
[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access
[ https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658573#comment-16658573 ] Hyukjin Kwon commented on SPARK-25778: -- Please avoid to set the priority +Critical which is usually reserved for committers. > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access > - > > Key: SPARK-25778 > URL: https://issues.apache.org/jira/browse/SPARK-25778 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, YARN >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, > 2.3.2 >Reporter: Greg Senia >Priority: Major > > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to > HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster > Mode for Spark > While attempting to use Spark Streaming and WriteAheadLogs. I noticed the > following errors after the driver attempted to recovery the already read data > that was being written to HDFS in the checkpoint folder. After spending many > hours looking at the cause of the following error below due to the fact the > parent folder /hadoop exists in our HDFS FS.. I am wonder if its possible to > make an option configurable to choose an alternate bogus directory that will > never be used. > hadoop fs -ls / > drwx-- - dsadmdsadm 0 2017-06-20 13:20 /hadoop > hadoop fs -ls /hadoop/apps > drwx-- - dsadm dsadm 0 2017-06-20 13:20 /hadoop/apps > streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala > val nonExistentDirectory = new File( > System.getProperty("java.io.tmpdir"), > UUID.randomUUID().toString).getAbsolutePath > writeAheadLog = WriteAheadLogUtils.createLogForReceiver( > SparkEnv.get.conf, nonExistentDirectory, hadoopConf) > dataRead = writeAheadLog.read(partition.walRecordHandle) > 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching > task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com. > 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 > as bytes > 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is > StorageLevel(disk, memory, 1 replicas) > 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory > on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB) > 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, > ha20t5002dn.tech.hdp.example.com, executor 1): > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.AccessControlException: Permission > denied: user=hdpdevspark, access=EXECUTE, > inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at >
[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access
[ https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656308#comment-16656308 ] Greg Senia commented on SPARK-25778: Could we add a snippit like this.. I just tested this in my environment and this option allows me to work around the issue. So could it be a -D or a spark flag to fix this. try { // The WriteAheadLogUtils.createLog*** method needs a directory to create a // WriteAheadLog object as the default FileBasedWriteAheadLog needs a directory for // writing log data. However, the directory is not needed if data needs to be read, hence // a dummy path is provided to satisfy the method parameter requirements. // FileBasedWriteAheadLog will not create any file or directory at that path. // FileBasedWriteAheadLog will not create any file or directory at that path. Also, // this dummy directory should not already exist otherwise the WAL will try to recover // past events from the directory and throw errors. var nonExistentDirectory = "" if (!(System.getProperty("hdfs.writeahead.tmpdir").isEmpty())) { nonExistentDirectory = new File( System.getProperty("hdfs.writeahead.tmpdir"), UUID.randomUUID().toString).getAbsolutePath } else { nonExistentDirectory = new File( System.getProperty("java.io.tmpdir"), UUID.randomUUID().toString).getAbsolutePath } writeAheadLog = WriteAheadLogUtils.createLogForReceiver( SparkEnv.get.conf, nonExistentDirectory, hadoopConf) dataRead = writeAheadLog.read(partition.walRecordHandle) } catch { case NonFatal(e) => throw new SparkException( s"Could not read data from write ahead log record ${partition.walRecordHandle}", e) > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access > - > > Key: SPARK-25778 > URL: https://issues.apache.org/jira/browse/SPARK-25778 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, YARN >Affects Versions: 2.2.1, 2.2.2, 2.3.1, 2.3.2 >Reporter: Greg Senia >Priority: Critical > > WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to > HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster > Mode for Spark > While attempting to use Spark Streaming and WriteAheadLogs. I noticed the > following errors after the driver attempted to recovery the already read data > that was being written to HDFS in the checkpoint folder. After spending many > hours looking at the cause of the following error below due to the fact the > parent folder /hadoop exists in our HDFS FS.. I am wonder if its possible to > make an option configurable to choose an alternate bogus directory that will > never be used. > hadoop fs -ls / > drwx-- - dsadmdsadm 0 2017-06-20 13:20 /hadoop > hadoop fs -ls /hadoop/apps > drwx-- - dsadm dsadm 0 2017-06-20 13:20 /hadoop/apps > streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala > val nonExistentDirectory = new File( > System.getProperty("java.io.tmpdir"), > UUID.randomUUID().toString).getAbsolutePath > writeAheadLog = WriteAheadLogUtils.createLogForReceiver( > SparkEnv.get.conf, nonExistentDirectory, hadoopConf) > dataRead = writeAheadLog.read(partition.walRecordHandle) > 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching > task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com. > 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 > as bytes > 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is > StorageLevel(disk, memory, 1 replicas) > 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory > on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB) > 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, > ha20t5002dn.tech.hdp.example.com, executor 1): > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at >