It's been a little difficult trying to get a repro, since the behavior
happens when there's more activity than what happens with just two nodes.
However, I did have two observations:
1. The Hadoop FileOutputCommitter has two modes, 1 and 2; 1 performs all the
renames from a single node, where as 2 performs them from all the nodes with
executors. By switching the mode from 2 to 1, it's reduced (but not
eliminated) the occurrences of this behavior happening.
2. Around the time the behavior happens, I see these in the IGFS hadoop
clients:
17/06/10 06:33:41 WARN TaskSetManager: Lost task 6.0 in stage 4.0 (TID 15,
BN4SCH102092522, executor 6): org.apache.spark.SparkException: Task failed
while writing rows
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:105)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.closeStream(HadoopIgfsOutProc.java:446)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutputStream.close(HadoopIgfsOutputStream.java:142)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at
org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2429)
at
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
at
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:91)
at
org.apache.spark.sql.hive.orc.OrcOutputWriter.close(OrcFileFormat.scala:251)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:252)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:191)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
... 8 more
Suppressed: java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.withReconnectHandling(HadoopIgfsWrapper.java:329)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.delete(HadoopIgfsWrapper.java:163)
at
org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem.delete(IgniteHadoopFileSystem.java:838)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortTask(FileOutputCommitter.java:615)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortTask(FileOutputCommitter.java:604)
at
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortTask(HadoopMapReduceCommitProtocol.scala:153)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:198)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1350)
... 9 more
Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.delete(HadoopIgfsOutProc.java:259)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper$6.apply(HadoopIgfsWrapper.java:166)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper$6.apply(HadoopIgfsWrapper.java:163)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.withReconnectHandling(HadoopIgfsWrapper.java:312)
... 16 more
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
... 21 more
Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.closeStream(HadoopIgfsOutProc.java:443)
... 21 more
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
... 23 more
17/06/10 06:32:59 WARN TaskSetManager: Lost task 15.1 in stage 4.0 (TID 291,
BN4SCH102092623, executor 5): java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:105)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsInputStream.read(HadoopIgfsInputStream.java:198)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:964)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:793)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:205)
at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:230)
at
org.apache.hadoop.hive.ql.io.orc.SparkOrcNewRecordReader.<init>(SparkOrcNewRecordReader.java:48)
at
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:155)
at
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:129)
at
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
at
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsInputStream.read(HadoopIgfsInputStream.java:186)
... 31 more
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
... 33 more
--
View this message in context:
http://apache-ignite-users.70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-tp13155p13599.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.