Hi all, I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). It's current compressed size is around 13 GB, but my problem started when it was much smaller, maybe 5 GB. This dataset is generated by performing a query on an existing ORC dataset in HDFS, selecting a subset of the existing data (i.e. removing duplicates). When I write this dataset to HDFS using ORC I get the following exceptions in the driver:
|org.apache.spark.SparkException: Task failed ||while| |writing rows | |Caused by: java.lang.RuntimeException: Failed to commit task | |Suppressed: java.lang.IllegalArgumentException: Column has wrong number of index entries found: ||0| |expected: ||32 | |Caused by: java.io.IOException: All datanodes ||127.0||.||0.1||:||50010| |are bad. Aborting... | This happens multiple times. The executors tell me the following a few times before the same exceptions as above: | |2016||-||12||-||09| |02||:||38||:||12.193| |INFO DefaultWriterContainer: Using output committer ||class| |org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter| 2016||-||12||-||09| |02||:||41||:||04.679| |WARN DFSClient: DFSOutputStream ResponseProcessor exception ||for| |block BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862425_121642| |java.io.EOFException: Premature EOF: no length prefix available| |||at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:||2203||)| |||at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:||176||)| |||at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:||867||)| My HDFS datanode says: |2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /||127.0||.||0.1||:||57836||, dest: /||127.0||.||0.1||:||50010||, bytes: ||14808395||, op: HDFS_WRITE, cliID: DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193, offset: ||0||, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637, duration: ||93026972| |2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637, type=LAST_IN_PIPELINE, downstreams=||0||:[] terminating| |2016||-||12||-||09| |02||:||39||:||49||,||262| |ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: ||XXX.XXX.XXX.XXX||||||||||||||:||50010||:DataXceiver error processing WRITE_BLOCK operation src: /||127.0||.||0.1||:||57790| |dst: /||127.0||.||0.1||:||50010| |java.net.SocketTimeoutException: ||60000| |millis timeout ||while| |waiting ||for| |channel to be ready ||for| |read. ch : java.nio.channels.SocketChannel[connected local=/||127.0||.||0.1||:||50010| |remote=/||127.0||.||0.1||:||57790||]| It looks like the datanode is receiving the block on multiple ports (threads?) and one of the sending connections terminates early. I was originally running 6 executors with 6 cores and 24 GB RAM each (Total: 36 cores, 144 GB) and experienced many of these issues, where occasionally my job would fail altogether. Lowering the number of cores appears to reduce the frequency of these errors, however I'm now down to 4 executors with 2 cores each (Total: 8 cores), which is significantly less, and still see approximately 1-3 task failures. Details: - Spark 1.6.3 - Standalone - RDD compression enabled - HDFS replication disabled - Everything running on the same host - Otherwise vanilla configs for Hadoop and Spark Does anybody have any ideas or hints? I can't imagine the problem is solely related to the number of executor cores. Thanks, Joe Naegele