Hello all,

I meet two problems in my work, and came here for help.

My working scenario as below,
I have four servers as one cluster. One is used as NameNode, and the other 
three are used as DataNode.
The hadoop version which I used is 3.3.3.
The replication of HDFS is 3, so every file will have one replication in each 
data node.


1.     Centralized Cache Management:



I tried to use below command, to cache partial files into DRAM.
hdfs cacheadmin -addDirective -path <> -pool <> -force



As the default value of repilication of cache is 1. So I suppose each file will 
only be cached in one node’s DRAM, and will be read in corresponding node by 
yarn’s scheduler.

For example,

File 1 is cached in node 1’s DRAM, so yarn will schedule the read task in node 
1 when I want to read it later.

File 2 is cached in node 2’s DRAM, so the read task will be scheduled in node 2

File 3 is cached in node 3’s DRAM, so the read task will be scheduled in node 3



But as I found, it seems the read tasks are ‘random’ scheduled in each data 
node, but not considering which node the file is cached.

For example,

File 1 is cached in node 1’s DRAM, but it will be read in node 2 when I trigger 
a read task.



Q: Why the read task is not scheduled in the node which the file is cached?
Do we have some configuration, which can lead the task scheduled in the data 
node which actually cached the file?



2.     JVM Reuse Function:



I try to configure below hadoop mapreduce parameter, to make multiple tasks in 
one job to reuse the same JVM.



mapred-site.xml
<property>
        <name>mapreduce.job.jvm.numtasks</name>
        <value>20</value>
</property>



But I still found the tasks using the different container ( I suppose that if 
the different JVM corresponding to different container ) from hadoop map-reduce 
job log.



Q: Is there any configuration which will block this JVM reuse feature? How can 
I know the JVM reused feature is enabled in my job.

Thanks a lot

Pan Yong

Reply via email to