I will try a fresh setup very soon.

Actually, I tried to compile spark by myself, against hadoop 2.5.2, but I
had the issue that I mentioned in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Master-doesn-t-start-no-logs-td23651.html

I was wondering if maybe serialization/deserialization configuration could
be the reason of my executor losses.

--
Henri Maxime Demoulin

2015-07-14 3:41 GMT-04:00 Akhil Das <ak...@sigmoidanalytics.com>:

> This is more like a version miss-match between your spark binaries and the
> hadoop. I have not tried accessing hadoop 2.5.x with spark 1.4.0 pre-build
> against hadoop 2.4, If possible you could upgrade your hadoop to 2.6 and
> download the spark binaries for that version, Or you can download the spark
> source and compile it with hadoop 2.5 version.
>
> Thanks
> Best Regards
>
> On Tue, Jul 14, 2015 at 2:18 AM, maxdml <maxdemou...@gmail.com> wrote:
>
>> Hi,
>>
>> I have several issues related to HDFS, that may have different roots. I'm
>> posting as much information as I can, with the hope that I can get your
>> opinion on at least some of them. Basically the cases are:
>>
>> - HDFS classes not found
>> - Connections with some datanode seems to be slow/ unexpectedly close.
>> - Executors become lost (and cannot be relaunched due to an out of memory
>> error)
>>
>> *
>> What I'm looking for:
>> - HDFS misconfiguration/ tuning advises
>> - Global setup flaws (impact of VMs and NUMA mismatch, for example)
>> - For the last category of issue, I'd like to know why, when the executor
>> dies, JVM's memory is not freed, thus not allowing a new executor to be
>> launched.*
>>
>> My setup is the following:
>> 1 hypervisor with 32 cores and 50 GB of RAM, 5 VMs running in this hv.
>> Each
>> vms has 5 cores and 7GB.
>> Each node has 1 worker setup with 4 cores 6 GB available (the remaining
>> resources are intended to be used by hdfs/os
>>
>> I run a Wordcount workload with a dataset of 4GB, on a spark 1.4.0 / hdfs
>> 2.5.2 setup. I got the binaries from official websites (no local
>> compiling).
>>
>> (1) & 2) are logged on the worker, in the work/app-id/exec-id/stderr file)
>>
>> *1) Hadoop class related issues*
>>
>> /15:34:32: DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes
>> are unavailable. Using the older Hadoop location info code.
>> java.lang.ClassNotFoundException:
>> org.apache.hadoop.mapred.InputSplitWithLocationInfo/
>>
>> /
>> 15:40:46: DEBUG SparkHadoopUtil: Couldn't find method for retrieving
>> thread-level FileSystem input data
>> java.lang.NoSuchMethodException:
>> org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()/
>>
>>
>> *2) HDFS performance related issues*
>>
>> The following error arise:
>>
>> / 15:43:16: ERROR TransportRequestHandler: Error sending result
>> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013,
>> chunkIndex=2},
>>
>> buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/2f/shuffle_0_14_0.data,
>> offset=15464702, length=998530}} to /192.168.122.168:59299; closing
>> connection
>> java.io.IOException: Broken pipe/
>>
>> /15:43:16 ERROR TransportRequestHandler: Error sending result
>> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013,
>> chunkIndex=0},
>>
>> buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/31/shuffle_0_12_0.data,
>> offset=15238441, length=980944}} to /192.168.122.168:59299; closing
>> connection
>> java.io.IOException: Broken pipe/
>>
>>
>> /15:44:28 : WARN TransportChannelHandler: Exception in connection from
>> /192.168.122.15:50995
>> java.io.IOException: Connection reset by peer/ (note that it's on another
>> executor)
>>
>> Some time later:
>> /
>> 15:44:52 DEBUG DFSClient: DFSClient seqno: -2 status: SUCCESS status:
>> ERROR
>> downstreamAckTimeNanos: 0
>> 15:44:52 WARN DFSClient: DFSOutputStream ResponseProcessor exception  for
>> block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758
>> java.io.IOException: Bad response ERROR for block
>> BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 from datanode
>> x.x.x.x:50010
>>         at
>>
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)/
>>
>> The following two errors appears several times:
>>
>> /15:51:05 ERROR Executor: Exception in task 19.0 in stage 1.0 (TID 51)
>> java.nio.channels.ClosedChannelException
>>         at
>>
>> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1528)
>>         at
>> org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98)
>>         at
>>
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
>>         at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>         at
>>
>> org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81)
>>         at
>>
>> org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102)
>>         at
>> org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95)
>>         at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1110)
>>         at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
>>         at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
>>         at
>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
>>         at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
>>         at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> /
>>
>> /15:51:19 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1:
>> [actor]
>> received message AssociationError
>> [akka.tcp://sparkExecutor@192.168.122.142:38277] ->
>> [akka.tcp://sparkDriver@x.x.x.x:34732]: Error [Invalid address:
>> akka.tcp://sparkDriver@x.x.x.x:34732] [
>> akka.remote.InvalidAssociation: Invalid address:
>> akka.tcp://sparkDriver@x.x.x.x:34732
>> Caused by: akka.remote.transport.Transport$InvalidAssociationException:
>> Connection refused: /x.x.x.x:34732
>> ] from Actor[akka://sparkExecutor/deadLetters]/
>>
>>
>> In the datanode's logs:
>>
>> /ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
>> localhost.localdomain:50010:DataXceiver error processing WRITE_BLOCK
>> operation  src: /192.168.122.15:56468 dst: /192.168.122.229:50010
>> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected
>> local=/192.168.122.229:50010 remote=/192.168.122.15:56468]/
>>
>> I also can find the following warnings:
>> /
>> 2015-07-13 15:46:57,927 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
>> data to disk cost:718ms (threshold=300ms)
>> 2015-07-13 15:46:59,933 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
>> packet to mirror took 1298ms (threshold=300ms)/
>>
>> 3) Executors losses
>>
>> Early in the job, the master's logs display the following messages:
>>
>> /15/07/13 13:46:50 INFO Master: Removing executor
>> app-20150713133347-0000/5
>> because it is EXITED
>> 15/07/13 13:46:50 INFO Master: Launching executor
>> app-20150713133347-0000/9
>> on worker worker-20150713153302-192.168.122.229-59013
>> 15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms)
>> ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited
>> with code 1),Some(1)) from
>> Actor[akka.tcp://
>> sparkWorker@192.168.122.229:59013/user/Worker#-83763597]/
>>
>> This will not cease until the job completes, or ends up failing (depending
>> on the number of executors actually failing.
>>
>> Here is the java logs available on each attempted executor launch (in
>> work/app-id/exec-id on the worker):
>> http://pastebin.com/B4FbXvHR
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-performances-unexpected-death-of-executors-tp23803.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to