It's been a little difficult trying to get a repro, since the behavior
happens when there's more activity than what happens with just two nodes.
However, I did have two observations:

1. The Hadoop FileOutputCommitter has two modes, 1 and 2; 1 performs all the
renames from a single node, where as 2 performs them from all the nodes with
executors. By switching the mode from 2 to 1, it's reduced (but not
eliminated) the occurrences of this behavior happening.

2. Around the time the behavior happens, I see these in the IGFS hadoop
clients:

17/06/10 06:33:41 WARN TaskSetManager: Lost task 6.0 in stage 4.0 (TID 15,
BN4SCH102092522, executor 6): org.apache.spark.SparkException: Task failed
while writing rows
        at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
        at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
        at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:105)
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.closeStream(HadoopIgfsOutProc.java:446)
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutputStream.close(HadoopIgfsOutputStream.java:142)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
        at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2429)
        at
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
        at
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:91)
        at
org.apache.spark.sql.hive.orc.OrcOutputWriter.close(OrcFileFormat.scala:251)
        at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:252)
        at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:191)
        at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
        at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
        at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
        ... 8 more
        Suppressed: java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
                at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
                at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.withReconnectHandling(HadoopIgfsWrapper.java:329)
                at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.delete(HadoopIgfsWrapper.java:163)
                at
org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem.delete(IgniteHadoopFileSystem.java:838)
                at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortTask(FileOutputCommitter.java:615)
                at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortTask(FileOutputCommitter.java:604)
                at
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortTask(HadoopMapReduceCommitProtocol.scala:153)
                at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:198)
                at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1350)
                ... 9 more
        Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
                at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
                at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
                at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.delete(HadoopIgfsOutProc.java:259)
                at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper$6.apply(HadoopIgfsWrapper.java:166)
                at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper$6.apply(HadoopIgfsWrapper.java:163)
                at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsWrapper.withReconnectHandling(HadoopIgfsWrapper.java:312)
                ... 16 more
        Caused by: java.lang.InterruptedException
                at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
                at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
                ... 21 more
Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsOutProc.closeStream(HadoopIgfsOutProc.java:443)
        ... 21 more
Caused by: java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
        ... 23 more




17/06/10 06:32:59 WARN TaskSetManager: Lost task 15.1 in stage 4.0 (TID 291,
BN4SCH102092623, executor 5): java.io.IOException: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:133)
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsUtils.cast(HadoopIgfsUtils.java:105)
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsInputStream.read(HadoopIgfsInputStream.java:198)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at
org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
        at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:964)
        at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:793)
        at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
        at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019)
        at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:205)
        at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:230)
        at
org.apache.hadoop.hive.ql.io.orc.SparkOrcNewRecordReader.<init>(SparkOrcNewRecordReader.java:48)
        at
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:155)
        at
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:129)
        at
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
        at
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
        at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
        at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: class
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
        at
org.apache.ignite.internal.processors.hadoop.impl.igfs.HadoopIgfsInputStream.read(HadoopIgfsInputStream.java:186)
        ... 31 more
Caused by: java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:161)
        ... 33 more



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/igfs-meta-behavior-when-node-restarts-tp13155p13599.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Reply via email to