subject:"\[jira\] \[Commented\] $SPARK\-25778$ WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access"

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1285#comment-1285
 ] 

Apache Spark commented on SPARK-25778:
--

User 'gss2002' has created a pull request for this issue:
https://github.com/apache/spark/pull/22867

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
>

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1284#comment-1284
 ] 

Apache Spark commented on SPARK-25778:
--

User 'gss2002' has created a pull request for this issue:
https://github.com/apache/spark/pull/22867

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
>

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

2018-10-22 Thread Greg Senia (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658934#comment-16658934
 ] 

Greg Senia commented on SPARK-25778:


[~hyukjin.kwon] thanks for the heads up on priority. Never realized that but 
appreciate it. Any thoughts on a property for this is a -D best or a new spark 
property?
Thanks

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
>

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

2018-10-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658573#comment-16658573
 ] 

Hyukjin Kwon commented on SPARK-25778:
--

Please avoid to set the priority +Critical which is usually reserved for 
committers.

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
>   at 
>

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

2018-10-18 Thread Greg Senia (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656308#comment-16656308
 ] 

Greg Senia commented on SPARK-25778:


Could we add a snippit like this.. I just tested this in my environment and 
this option allows me to work around the issue. So could it be a -D or a spark 
flag to fix this.

  try {
// The WriteAheadLogUtils.createLog*** method needs a directory to 
create a
// WriteAheadLog object as the default FileBasedWriteAheadLog needs a 
directory for
// writing log data. However, the directory is not needed if data needs 
to be read, hence
// a dummy path is provided to satisfy the method parameter 
requirements.
// FileBasedWriteAheadLog will not create any file or directory at that 
path.
// FileBasedWriteAheadLog will not create any file or directory at that 
path. Also,
// this dummy directory should not already exist otherwise the WAL will 
try to recover
// past events from the directory and throw errors.
var nonExistentDirectory = ""
if (!(System.getProperty("hdfs.writeahead.tmpdir").isEmpty())) {
  nonExistentDirectory = new File(
System.getProperty("hdfs.writeahead.tmpdir"), 
UUID.randomUUID().toString).getAbsolutePath
} else {
  nonExistentDirectory = new File(
System.getProperty("java.io.tmpdir"), 
UUID.randomUUID().toString).getAbsolutePath
}
writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
  SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
dataRead = writeAheadLog.read(partition.walRecordHandle)
  } catch {
case NonFatal(e) =>
  throw new SparkException(
s"Could not read data from write ahead log record 
${partition.walRecordHandle}", e)

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.2.1, 2.2.2, 2.3.1, 2.3.2
>Reporter: Greg Senia
>Priority: Critical
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
>

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

5 matches

Site Navigation

Mail list logo

Footer information