There is probably a subtlety between the ability to run tasks with data
process-local and node-local that I think I'm missing.

I'm doing a basic test which is the following:

1) Copy a large text file from the local file system into HDFS using
hadoop fs -copyFromLocal

2) Run Spark's wordcount example against the text file in HDFS

Sometimes when I run, tasks are executed with the data presumably being
process-local, such as the below when it starts up running tasks 0
through 3.

14/06/12 08:02:31 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor 
0: apex90.llnl.gov (PROCESS_LOCAL)
14/06/12 08:02:31 INFO TaskSetManager: Serialized task 1.0:0 as 2458 bytes in 2 
ms
14/06/12 08:02:31 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor 
0: apex90.llnl.gov (PROCESS_LOCAL)
14/06/12 08:02:31 INFO TaskSetManager: Serialized task 1.0:1 as 2458 bytes in 0 
ms
14/06/12 08:02:31 INFO TaskSetManager: Starting task 1.0:2 as TID 2 on executor 
0: apex90.llnl.gov (PROCESS_LOCAL)
14/06/12 08:02:31 INFO TaskSetManager: Serialized task 1.0:2 as 2458 bytes in 0 
ms
14/06/12 08:02:31 INFO TaskSetManager: Starting task 1.0:3 as TID 3 on executor 
0: apex90.llnl.gov (PROCESS_LOCAL)
14/06/12 08:02:31 INFO TaskSetManager: Serialized task 1.0:3 as 2458 bytes in 0 
ms

sometimes almost all the tasks in the job run process-local, sometimes
it goes to node-local / node-any somewhere in the middle.

Other times (more commonly when I run this test with higher node
counts), the tasks are always run with data presumably node-local, such
as the below when it starts up running tasks 0 through 3.

14/06/11 22:58:38 INFO TaskSetManager: Starting task 1.0:21 as TID 0 on 
executor 5: apex80.llnl.gov (NODE_LOCAL)
14/06/11 22:58:38 INFO TaskSetManager: Serialized task 1.0:21 as 2458 bytes in 
2 ms
14/06/11 22:58:38 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor 
27: apex78.llnl.gov (NODE_LOCAL)
14/06/11 22:58:38 INFO TaskSetManager: Serialized task 1.0:1 as 2458 bytes in 1 
ms
14/06/11 22:58:38 INFO TaskSetManager: Starting task 1.0:3 as TID 2 on executor 
14: apex82.llnl.gov (NODE_LOCAL)
14/06/11 22:58:38 INFO TaskSetManager: Serialized task 1.0:3 as 2458 bytes in 0 
ms
14/06/11 22:58:38 INFO TaskSetManager: Starting task 1.0:11 as TID 3 on 
executor 15: apex105.llnl.gov (NODE_LOCAL)
14/06/11 22:58:38 INFO TaskSetManager: Serialized task 1.0:11 as 2458 bytes in 
0 ms

As expected, tasks run slower on node-local than the process-local
tasks, and subsequently those jobs run slower.

So my question

1) How is this data process-local?  I *just* copied it into HDFS.  No
spark worker or executor should have loaded it.

Between runs I delete the data from HDFS, delete the Spark local dir
where data is cached, and restart the Spark daemons.

I've seen the behavior with Spark 0.9.1 and 1.0.0, although with
different varying node counts.  My environment is a bit unique, where I
run HDFS over a parallel networked file system, but I think what I'm
seeing should be independent of that.

I'm sure there's something subtle I'm missing or not understanding,
thanks in advance.

Al

-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


Reply via email to