Hi all,

I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 
1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). 
It's current compressed size is around 13 GB, but my problem started when it 
was much smaller, maybe 5 GB. This dataset is generated by performing a query 
on an existing ORC dataset in HDFS, selecting a subset of the existing data 
(i.e. removing duplicates). When I write this dataset to HDFS using ORC I get 
the following exceptions in the driver:

|org.apache.spark.SparkException: Task failed ||while| |writing rows
| |Caused by: java.lang.RuntimeException: Failed to commit task
| |Suppressed: java.lang.IllegalArgumentException: Column has wrong number of 
index entries found: ||0| |expected: ||32
|
|Caused by: java.io.IOException: All datanodes ||127.0||.||0.1||:||50010| |are 
bad. Aborting...

|
This happens multiple times. The executors tell me the following a few times 
before the same exceptions as above:

|
|2016||-||12||-||09| |02||:||38||:||12.193| |INFO DefaultWriterContainer: Using 
output committer ||class| 
|org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter|
2016||-||12||-||09| |02||:||41||:||04.679| |WARN DFSClient: DFSOutputStream 
ResponseProcessor exception ||for| |block 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862425_121642|
|java.io.EOFException: Premature EOF: no length prefix available|
|||at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:||2203||)|
|||at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:||176||)|
|||at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:||867||)|

My HDFS datanode says:

|2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/||127.0||.||0.1||:||57836||, dest: /||127.0||.||0.1||:||50010||, bytes: 
||14808395||, op: HDFS_WRITE, cliID: 
DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193, offset: ||0||, 
srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637,
 duration: ||93026972|
|2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637,
 type=LAST_IN_PIPELINE, downstreams=||0||:[] terminating|
|2016||-||12||-||09| |02||:||39||:||49||,||262| |ERROR 
org.apache.hadoop.hdfs.server.datanode.DataNode: 
||XXX.XXX.XXX.XXX||||||||||||||:||50010||:DataXceiver error processing 
WRITE_BLOCK operation  src: /||127.0||.||0.1||:||57790| |dst: 
/||127.0||.||0.1||:||50010|
|java.net.SocketTimeoutException: ||60000| |millis timeout ||while| |waiting 
||for| |channel to be ready ||for| |read. ch : 
java.nio.channels.SocketChannel[connected local=/||127.0||.||0.1||:||50010| 
|remote=/||127.0||.||0.1||:||57790||]|

It looks like the datanode is receiving the block on multiple ports (threads?) 
and one of the sending connections terminates early.

I was originally running 6 executors with 6 cores and 24 GB RAM each (Total: 36 
cores, 144 GB) and experienced many of these issues, where occasionally my job 
would fail altogether. Lowering the number of cores appears to reduce the 
frequency of these errors, however I'm now down to 4 executors with 2 cores 
each (Total: 8 cores), which is significantly less, and still see approximately 
1-3 task failures.

Details:
- Spark 1.6.3 - Standalone
- RDD compression enabled
- HDFS replication disabled
- Everything running on the same host
- Otherwise vanilla configs for Hadoop and Spark

Does anybody have any ideas or hints? I can't imagine the problem is solely 
related to the number of executor cores.

Thanks,
Joe Naegele

Reply via email to