Re: Application failure in yarn-cluster mode

Christophe Préaud Thu, 16 Oct 2014 02:38:00 -0700

Hi,

I have been able to reproduce this problem on our dev environment, I am fairly 
sure now that it is indeed a bug.
As a consequence, I have created a JIRA 
(SPARK-3967<https://issues.apache.org/jira/browse/SPARK-3967>) for this issue, 
which is triggered when yarn.nodemanager.local-dirs (not hadoop.tmp.dir, as I 
said below) is set to a comma-separated list of directories which are located 
on different disks/partitions.


Regards,
Christophe.

On 14/10/2014 09:37, Christophe Préaud wrote:
Hi,

Sorry to insist, but I really feel like the problem described below is a bug in 
Spark.
Can anybody confirm if it is a bug, or a (configuration?) problem on my side?

Thanks,
Christophe.

On 10/10/2014 18:24, Christophe Préaud wrote:
Hi,

After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed 
most of the time (but not always) in yarn-cluster mode (but not in yarn-client 
mode).

Here is my configuration:

 *   spark-1.1.0
 *   hadoop-2.2.0

And the hadoop.tmp.dir definition in the hadoop core-site.xml file (each 
directory is on its own partition, on different disks):
<property>
  <name>hadoop.tmp.dir</name>
  
<value>file:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local</value>
</property>

After investigating, it turns out that the problem is when the executor fetches 
a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local 
(see hadoop.tmp.dir definition above), and then moved in one of the temporary 
directory defined in hadoop.tmp.dir:

 *   if it is the same than the temporary file (i.e. /d1/yarn/local), then the 
application continues normally
 *   if it is another one (i.e. /d2/yarn/local, /d3/yarn/local,...), it fails 
with the following error:

14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 0)
java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at com.google.common.io.Files$FileByteSink.openStream(Files.java:223)
        at com.google.common.io.Files$FileByteSink.openStream(Files.java:211)
        at com.google.common.io.ByteSource.copyTo(ByteSource.java:203)
        at com.google.common.io.Files.copy(Files.java:436)
        at com.google.common.io.Files.move(Files.java:651)
        at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440)
        at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325)
        at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323)
        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
        at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

I have no idea why the move fails when the source and target files are not on 
the same partition, for the moment I have worked around the problem with the 
attached patch (i.e. I ensure that the temp file and the moved file are always 
on the same partition).

Any thought about this problem?

Thanks!
Christophe.

________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: Application failure in yarn-cluster mode

Reply via email to