It seems like an issue w/ Hadoop. What do you get when you run hdfs
dfsadmin -report?

Anecdotally(And w/o specifics as it has been a while), I've generally used
Parquet instead of ORC as I've gotten a bunch of random problems reading
and writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it
can be a pain.

On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele <jnaeg...@grierforensics.com
> wrote:

> Hi all,
>
> I'm having trouble with a relatively simple Spark SQL job. I'm using Spark
> 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record).
> It's current compressed size is around 13 GB, but my problem started when
> it was much smaller, maybe 5 GB. This dataset is generated by performing a
> query on an existing ORC dataset in HDFS, selecting a subset of the
> existing data (i.e. removing duplicates). When I write this dataset to HDFS
> using ORC I get the following exceptions in the driver:
>
> org.apache.spark.SparkException: Task failed while writing rows
> Caused by: java.lang.RuntimeException: Failed to commit task
> Suppressed: java.lang.IllegalArgumentException: Column has wrong number
> of index entries found: 0 expected: 32
> Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad.
> Aborting...
>
> This happens multiple times. The executors tell me the following a few
> times before the same exceptions as above:
>
> 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output
> committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block BP-1695049761-192.168.2.211-1479228275669
> :blk_1073862425_121642
> java.io.EOFException: Premature EOF: no length prefix available
>         at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:2203)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:867)
>
> My HDFS datanode says:
>
> 2016-12-09 02:39:24,783 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
> src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op:
> HDFS_WRITE, cliID: 
> DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193,
> offset: 0, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: BP-
> 1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, duration:
> 93026972
> 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder: 
> BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation
> src: /127.0.0.1:57790 dst: /127.0.0.1:50010
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel
> to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/127.0.0.1:50010 remote=/127.0.0.1:57790]
>
> It looks like the datanode is receiving the block on multiple ports
> (threads?) and one of the sending connections terminates early.
>
> I was originally running 6 executors with 6 cores and 24 GB RAM each
> (Total: 36 cores, 144 GB) and experienced many of these issues, where
> occasionally my job would fail altogether. Lowering the number of cores
> appears to reduce the frequency of these errors, however I'm now down to 4
> executors with 2 cores each (Total: 8 cores), which is significantly less,
> and still see approximately 1-3 task failures.
>
> Details:
> - Spark 1.6.3 - Standalone
> - RDD compression enabled
> - HDFS replication disabled
> - Everything running on the same host
> - Otherwise vanilla configs for Hadoop and Spark
>
> Does anybody have any ideas or hints? I can't imagine the problem is
> solely related to the number of executor cores.
>
> Thanks,
> Joe Naegele
>

Reply via email to