Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Jeff Hubbs Fri, 15 Jun 2018 08:44:27 -0700

Yet at the same time, I'm looking at the log for the ResourceManagerdaemon (which runs on msba02b in addition with the NodeManager andDataNode daemons just to take some load off of msba02a) and it knows allthree nodes are there:


   INFO
   org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
   Added node msba02a:38409 cluster capacity: <memory:16384, vCores:8>
   INFO
   org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
   Added node msba02b:38809 cluster capacity: <memory:32768, vCores:16>
   INFO
   org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
   Added node msba02c:38025 cluster capacity: <memory:49152, vCores:24>

And the NodeManager daemon on that same machine (which won't do anymapreduce work) is indeed connecting to the ResourceManager:


   INFO
   org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
   Registered with ResourceManager as msba02b:38809 with total resource
   of <memory:16384, vCores:8>

In fact it looks as though the whole cluster is just getting along withitself swimmingly:


   $ for d in msba02a msba02b msba02c; do echo $d; ssh root@"$d" "grep
   FATAL /var/log/hadoop/*.log" ; done
   msba02a
   msba02b
   msba02c

Do I need to try different example code? Any other ideas for me to lookinto?











On 6/14/18 6:26 PM, Jeff Hubbs wrote:

I looked through that material and used it to start over with a newbinary distribution of 3.1.0 on all three machines and the situationis unchanged. Again, HDFS works perfectly in write and read, but thewordcount mapreduce job will still only run on one thread at a time onthe machine on which I execute the job (doing so doesn't exercise theother two machines' DataNode daemons because all three machines runthe DataNode daemon and replication is set to three, so the workdoesn't need to read from the other datanodes.
I tried using the start-yarn.sh script instead of starting theResourceManager daemon and the NodeManager daemons separately; itdidn't make any difference but the script ran ResourceManager onmsba02a instead of msba02b as intended, which eventually made theNodeManagers throw warnings and shut down because whereasstart-yarn.sh apparently didn't pay any attention to the valueyarn.resourcemanager.hostname (msba02b), the NodeManagers did.
Since I have about six minutes while the wordcount job finishes, Ihave plenty of time to knock around the web interfaces and while theResourceManager app reports three healthy nodes with 48GiB total RAMand 24 VCores like I expect, everything else is zeroed out while thejob is running; Apps Running, Containers Running, Apps Submitted,Memory Used, Memory Reserved, VCores Reserved are all balls and thetable underneath Scheduler Metrics just says "No data available in table."
I think it's noteworthy that not only does work not distribute acrossmachines, it doesn't even distribute across threads and cores in theone machine the job runs on. As I said, before I went static IPs andcaptive-LAN for the three nodes, all 24 threads would light up runningthis job.
On 6/13/18 1:59 PM, Jeff Hubbs wrote:
Gour -
Thank you; I'll certainly look into that. On Monday I performed anexperiment where I reduced the cluster down to a two-node, puttingall the daemons that were unique to msba02 onto msba02b andreconstructing HDFS as appropriate. This way, no active machine wasdual-homed; they ran as they had before I changed the network exceptfor having static IPs and name resolution via host table. When I didthis and ran the wordcount mapreduce job, I observed the samebehavior: everything ran on just one core of msba02b until the outputfile (with all the found words and their number of instances - it's a770-MiB file) back out to HDFS.
I'm about to start part of the way over with a fresh binarydistribution of 3.1.0 and see what happens. I thought I would alsolook into the systems' name resolution priority and make /etc/hostscome first.
On 6/13/18 11:02 AM, Gour Saha wrote:
Looks like the YARN/MR multihoming doc patch never got committed andhence not available in the site documentation. You can look into thedoc patch in https://issues.apache.org/jira/browse/YARN-2384 (may beuse an online markdown tool to view it better) and see if youfollowed the configuration mentioned there. Another comprehensivemultihoming document which might help you is here<https://hortonworks.com/blog/multihoming-on-hadoop-yarn-clusters/>.
-Gour

*From: *Jeff Hubbs <jhubbsl...@att.net>
*Date: *Tuesday, June 5, 2018 at 2:57 PM
*To: *"user@hadoop.apache.org" <user@hadoop.apache.org>
*Subject: *3.1.0 MR work won't distribute after dual-homing NameNode

Hi -
I have a three node Hadoop 3.1.0 cluster on which the daemons aredistributed like so:
Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode
All three nodes run Gentoo Linux and have either one or two volumesdevoted to HDFS; HDFS reports a size of 5.7TiB.
Previously, HDFS and MapReduce (testing with the archetypical"wordcount" job on a 5.8GiB XML file) worked fine in an environmentwhere all three machines are on the same office LAN and get their IPaddresses from DHCP; dynamic DNS creates network host names based onthe machines' host names as reported by the machines' DHCP clients.FQDNs were used for all intra- and inter-machine references in theHadoop configuration files.
Since then, I've changed things so that msba02a now has a second NICthat connects to an independent LAN along with the other twomachines using their built-in NICs like before; msba02b and msba02creach the Internet by going through NAT on msba02a. /etc/hosts onall three machines has been populated with the static IPs I gavethem like so:
    127.0.0.1 localhost
    1.0.0.1 msba02a
    1.0.0.10 msba02b
    1.0.0.20 msba02c
So now if I shell into msba02a and run the wordcount job with thetest XML file sitting in HDFS with replication set to 3, the job*does* run and gives me the expected output file...but the workloaddoesn't distribute to all cores on all nodes like before; it allexecutes on msba02a. In fact, it doesn't even run on all cores onmsba02a; it seems to light up just one core at any given moment. Thejob used to run on the cluster in 1m48s; now it takes 5m56 (a ratioI can't understand; these are all four-core, eight-thread machinesso I'd expect a ratio of close to 24:1, not 3:1). The only time theother two nodes light up at all is near the end of the job when theoutput file (770MiB) is written out to HDFS.
I've gone throughhttps://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.htmlandset the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a inhopes of getting the daemons to bind to the cluster-facing NICinstead of the outward-facing NIC, but it seems to me like HDFS isworking exactly like it's supposed to. Note that the ResourceManagerdaemon runs on msba02b and therefore doesn't need to be bound to aparticular NIC; it still uses that machine's only NIC like beforeexcept now its IP address is static and is resolved via its local/etc/hosts.
The only errors showing up in the daemon logs of any nodes seem tobe e.g."org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:ExpiredTokenRemover received java.lang.InterruptedException: sleepinterrupted" in hadoop-yarn-resourcemanager-msba02b.log andhadoop-mapred-historyserver-msba02a.log.
As for the hadoop run output, previously when everything was workingthings would get to point where it would print out a series of lineslike
    map 0% reduce 0%
and that line would repeat with "map" percentage climbing first andthen the "reduce" percentage would climb until both numbers reached100% and the job would wrap up soon afterward. Now, it interspersesthose lines with other output and it skips around, like this:
    *2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%*
    2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
    2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator
    61480136 kv 15370028(61480112) kvi 13480948(53923792)
    2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
    2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart =
    61480136; bufend = 10372007; bufvoid = 104857566
    2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart =
    15370028(61480112); kvend = 7835876(31343504); length =
    7534153/6553600
    2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991
    kvi 4499492(17997968)
    2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
    2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator
    17997991 kv 4499492(17997968) kvi 2642780(10571120)
    2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
    2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of
    map output
    2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
    2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart =
    17997991; bufend = 40956853; bufvoid = 104857600
    2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart =
    4499492(17997968); kvend = 1327036(5308144); length =
    3172457/6553600
    *2018-06-05 17:45:39,340 INFO mapreduce.Job: map 4% reduce 0%*
    2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
    2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted
    segments
    2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last
    merge-pass, with 3 segments left of total size: 34645401 bytes
    2018-06-05 17:45:40,251 INFO mapred.Task:
    Task:attempt_local1155504279_0001_m_000002_0 is done. And is in
    the process of committing
    2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
    2018-06-05 17:45:40,253 INFO mapred.Task: Task
    'attempt_local1155504279_0001_m_000002_0' done.
    2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for
    attempt_local1155504279_0001_m_000002_0: Counters: 23
        File System Counters
            FILE: Number of bytes read=106419805
            FILE: Number of bytes written=202253153
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=410006948
            HDFS: Number of bytes written=0
            HDFS: Number of read operations=9
            HDFS: Number of large read operations=0
            HDFS: Number of write operations=1
        Map-Reduce Framework
            Map input records=2653033
            Map output records=4553651
            Map output bytes=130562451
            Map output materialized bytes=31060160
            Input split bytes=95
            Combine input records=5425504
            Combine output records=1618222
            Spilled Records=1618222
            Failed Shuffles=0
            Merged Map outputs=0
            GC time elapsed (ms)=114
            Total committed heap usage (bytes)=1301807104
        File Input Format Counters
            Bytes Read=134348800
    2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing
    task: attempt_local1155504279_0001_m_000002_0
    2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting
    task: attempt_local1155504279_0001_m_000003_0
    2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File
    Output Committer Algorithm version is 2
    2018-06-05 17:45:40,254 INFO output.FileOutputCommitter:
    FileOutputCommitter skip cleanup _temporary folders under output
    directory:false, ignore cleanup failures: false
    2018-06-05 17:45:40,254 INFO mapred.Task:  Using
    ResourceCalculatorProcessTree : [ ]
    2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split:
    hdfs://msba02a:9000/allcat.xml:268435456+134217728
    2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi
    26214396(104857584)
    2018-06-05 17:45:40,266 INFO mapred.MapTask:
    mapreduce.task.io.sort.mb: 100
    2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
    2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0;
    bufvoid = 104857600
    2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396;
    length = 6553600
    2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output
    collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
    *2018-06-05 17:45:40,341 INFO mapreduce.Job: map 100% reduce 0%*
    2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
    2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0;
    bufend = 53799451; bufvoid = 104857600
    2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart =
    26214396(104857584); kvend = 18692744(74770976); length =
    7521653/6553600
    2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451
    kvi 15356356(61425424)
    2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
    2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator
    61425451 kv 15356356(61425424) kvi 13514352(54057408)
    2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
    2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart =
    61425451; bufend = 10294846; bufvoid = 104857586
    2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart =
    15356356(61425424); kvend = 7816592(31266368); length =
    7539765/6553600
    2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846
    kvi 4480204(17920816)
    2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
    2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator
    17920846 kv 4480204(17920816) kvi 2573716(10294864)
    2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
    2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of
    map output
    2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
    2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart =
    17920846; bufend = 41420321; bufvoid = 104857600
    2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart =
    4480204(17920816); kvend = 1126824(4507296); length =
    3353381/6553600
Any hints as to why work isn't distributing? It seems to me likethis kind of network configuration for Hadoop clusters would be morethe norm than one where all nodes are on a network with everythingelse in an environment (in our situation one driver for havingcluster traffic isolated is because the data files used may containNDA-bound data that shouldn't travel the office LAN unencrypted).
Thanks!

Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Reply via email to