I will try a fresh setup very soon. Actually, I tried to compile spark by myself, against hadoop 2.5.2, but I had the issue that I mentioned in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Master-doesn-t-start-no-logs-td23651.html
I was wondering if maybe serialization/deserialization configuration could be the reason of my executor losses. -- Henri Maxime Demoulin 2015-07-14 3:41 GMT-04:00 Akhil Das <ak...@sigmoidanalytics.com>: > This is more like a version miss-match between your spark binaries and the > hadoop. I have not tried accessing hadoop 2.5.x with spark 1.4.0 pre-build > against hadoop 2.4, If possible you could upgrade your hadoop to 2.6 and > download the spark binaries for that version, Or you can download the spark > source and compile it with hadoop 2.5 version. > > Thanks > Best Regards > > On Tue, Jul 14, 2015 at 2:18 AM, maxdml <maxdemou...@gmail.com> wrote: > >> Hi, >> >> I have several issues related to HDFS, that may have different roots. I'm >> posting as much information as I can, with the hope that I can get your >> opinion on at least some of them. Basically the cases are: >> >> - HDFS classes not found >> - Connections with some datanode seems to be slow/ unexpectedly close. >> - Executors become lost (and cannot be relaunched due to an out of memory >> error) >> >> * >> What I'm looking for: >> - HDFS misconfiguration/ tuning advises >> - Global setup flaws (impact of VMs and NUMA mismatch, for example) >> - For the last category of issue, I'd like to know why, when the executor >> dies, JVM's memory is not freed, thus not allowing a new executor to be >> launched.* >> >> My setup is the following: >> 1 hypervisor with 32 cores and 50 GB of RAM, 5 VMs running in this hv. >> Each >> vms has 5 cores and 7GB. >> Each node has 1 worker setup with 4 cores 6 GB available (the remaining >> resources are intended to be used by hdfs/os >> >> I run a Wordcount workload with a dataset of 4GB, on a spark 1.4.0 / hdfs >> 2.5.2 setup. I got the binaries from official websites (no local >> compiling). >> >> (1) & 2) are logged on the worker, in the work/app-id/exec-id/stderr file) >> >> *1) Hadoop class related issues* >> >> /15:34:32: DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes >> are unavailable. Using the older Hadoop location info code. >> java.lang.ClassNotFoundException: >> org.apache.hadoop.mapred.InputSplitWithLocationInfo/ >> >> / >> 15:40:46: DEBUG SparkHadoopUtil: Couldn't find method for retrieving >> thread-level FileSystem input data >> java.lang.NoSuchMethodException: >> org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()/ >> >> >> *2) HDFS performance related issues* >> >> The following error arise: >> >> / 15:43:16: ERROR TransportRequestHandler: Error sending result >> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013, >> chunkIndex=2}, >> >> buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/2f/shuffle_0_14_0.data, >> offset=15464702, length=998530}} to /192.168.122.168:59299; closing >> connection >> java.io.IOException: Broken pipe/ >> >> /15:43:16 ERROR TransportRequestHandler: Error sending result >> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013, >> chunkIndex=0}, >> >> buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/31/shuffle_0_12_0.data, >> offset=15238441, length=980944}} to /192.168.122.168:59299; closing >> connection >> java.io.IOException: Broken pipe/ >> >> >> /15:44:28 : WARN TransportChannelHandler: Exception in connection from >> /192.168.122.15:50995 >> java.io.IOException: Connection reset by peer/ (note that it's on another >> executor) >> >> Some time later: >> / >> 15:44:52 DEBUG DFSClient: DFSClient seqno: -2 status: SUCCESS status: >> ERROR >> downstreamAckTimeNanos: 0 >> 15:44:52 WARN DFSClient: DFSOutputStream ResponseProcessor exception for >> block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 >> java.io.IOException: Bad response ERROR for block >> BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 from datanode >> x.x.x.x:50010 >> at >> >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)/ >> >> The following two errors appears several times: >> >> /15:51:05 ERROR Executor: Exception in task 19.0 in stage 1.0 (TID 51) >> java.nio.channels.ClosedChannelException >> at >> >> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1528) >> at >> org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98) >> at >> >> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) >> at java.io.DataOutputStream.write(DataOutputStream.java:107) >> at >> >> org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81) >> at >> >> org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102) >> at >> org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95) >> at >> >> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1110) >> at >> >> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) >> at >> >> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) >> at >> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) >> at >> >> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) >> at >> >> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) >> at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) >> at org.apache.spark.scheduler.Task.run(Task.scala:70) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) >> at >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> / >> >> /15:51:19 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: >> [actor] >> received message AssociationError >> [akka.tcp://sparkExecutor@192.168.122.142:38277] -> >> [akka.tcp://sparkDriver@x.x.x.x:34732]: Error [Invalid address: >> akka.tcp://sparkDriver@x.x.x.x:34732] [ >> akka.remote.InvalidAssociation: Invalid address: >> akka.tcp://sparkDriver@x.x.x.x:34732 >> Caused by: akka.remote.transport.Transport$InvalidAssociationException: >> Connection refused: /x.x.x.x:34732 >> ] from Actor[akka://sparkExecutor/deadLetters]/ >> >> >> In the datanode's logs: >> >> /ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: >> localhost.localdomain:50010:DataXceiver error processing WRITE_BLOCK >> operation src: /192.168.122.15:56468 dst: /192.168.122.229:50010 >> java.net.SocketTimeoutException: 60000 millis timeout while waiting for >> channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected >> local=/192.168.122.229:50010 remote=/192.168.122.15:56468]/ >> >> I also can find the following warnings: >> / >> 2015-07-13 15:46:57,927 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write >> data to disk cost:718ms (threshold=300ms) >> 2015-07-13 15:46:59,933 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write >> packet to mirror took 1298ms (threshold=300ms)/ >> >> 3) Executors losses >> >> Early in the job, the master's logs display the following messages: >> >> /15/07/13 13:46:50 INFO Master: Removing executor >> app-20150713133347-0000/5 >> because it is EXITED >> 15/07/13 13:46:50 INFO Master: Launching executor >> app-20150713133347-0000/9 >> on worker worker-20150713153302-192.168.122.229-59013 >> 15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms) >> ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited >> with code 1),Some(1)) from >> Actor[akka.tcp:// >> sparkWorker@192.168.122.229:59013/user/Worker#-83763597]/ >> >> This will not cease until the job completes, or ends up failing (depending >> on the number of executors actually failing. >> >> Here is the java logs available on each attempted executor launch (in >> work/app-id/exec-id on the worker): >> http://pastebin.com/B4FbXvHR >> >> >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-performances-unexpected-death-of-executors-tp23803.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >