[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

Szilard Nemeth (Jira) Tue, 22 Dec 2020 07:54:07 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589
 ]


Szilard Nemeth commented on YARN-10427:
---------------------------------------

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported 
Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've 
encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some 
time.

As the process of running SLS involved some repetitive tasks like uploading 
configs to the remote machine, launch SLS, save the resulted logs..., I created 
some scripts into my public Github repo here: 
[https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me break summarize what are these scripts are doing: 
 1. [config 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]:
 This is the exact same configuration file set that you attached to this jira, 
with one exception of the log4j.properties file, that turns on DEBUG logging 
for SLS.

2. [upstream-patches 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]:
 This is the directory of the logging patch that helped me see the issues more 
clearly.
 My code changes are also pushed to my Hadoop fork: 
[https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]:
 This is the directory that contains all my scripts to build Hadoop + launch 
SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called 
[setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]
 that contains some configuration values for the remote cluster + some local 
directories. If you want to use the scripts, all you need to do is to replace 
the configs in this file according to your environment.

3.1 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]:
 This is the script that builds Hadoop according to the environment variables 
and launches the SLS suite on the remote cluster.

3.2 
[start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]:
 This is the most important script as this will be executed on the remote 
machine. 
 I think the script itself is straightforward enough, but let me briefly list 
what it does:
 - This script assumes that the Hadoop dist package is copied to the remote 
machine (this was done by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular 
configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 
[launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]:
 This script is executed by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]
 as its last step. Once the start-sls.sh is finished, the 
[save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh]
 script is started. As the name implies it saves the latest SLS log dir and 
SCPs it to the local machine. The target directory of the local machine is 
determined by the config 
([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]).

*The latest logs and grepped logs for the SLS run is saved to my repo 
[here.|https://github.com/szilard-nemeth/linux-env/tree/96ed3d8af9f4677866652bb57153713b29f24a98/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513]*
h2. What causes the duplicate Job IDs

1. The jobruntime.csv file is being written with class SchedulerMetrics, you 
can see the init part 
[here|https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerMetrics.java#L180-L186].

2. The jobruntime records (lines of CSV file) are written with method 
[SchedulerMetrics#addAMRuntime|https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerMetrics.java#L661-L674].
 We only need to check the call hierarchy of this method to reveal the reason 
of duplicate application IDs.

*2.1 Call hierarchy #1 (From bottom to top):*
{code:java}
org.apache.hadoop.yarn.sls.scheduler.SchedulerMetrics#addAMRuntime
  org.apache.hadoop.yarn.sls.appmaster.AMSimulator#lastStep
    org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator#lastStep
      
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator#processResponseQueue{code}
*2.2 Call hierarchy #2 (From bottom to top):*
{code:java}
org.apache.hadoop.yarn.sls.scheduler.SchedulerMetrics#addAMRuntime
  org.apache.hadoop.yarn.sls.appmaster.AMSimulator#lastStep
    org.apache.hadoop.yarn.sls.scheduler.TaskRunner.Task#run 
{code}
3. These duplicate calls of MRAMSimulator#lastStep can be easily justified with 
the logs as well. 
[apps-shuttingdown.log|https://github.com/szilard-nemeth/linux-env/blob/0d41e4dbda5e3a22105c4fe27f540ae8004857fe/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/grepped/apps-shuttingdown.log]
 In this logfile, it's clearly visible that 9 apps 
(application_1608638719822_0001 - application_1608638719822_0009) are "shutting 
down" 2 times. 
 This is because the MRAMSimulator#lastStep is called twice.
 As MRAMSimulator#lastStep calls 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator#lastStep (super method), I 
added some logging that prints the stacktrace of lastStep method calls: 
[AMSimulator#lastStep|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L223-L225].

Let's take application_1608638719822_0001 as an example with this file: 
[laststep-calls-for-app0001.log|https://github.com/szilard-nemeth/linux-env/blob/96ed3d8af9f4677866652bb57153713b29f24a98/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/laststep-calls-for-app0001.log]

4. Checking the 2 stacktraces:

*4.1 Stacktrace #1: Call to lastStep from MRAMSimulator#processResponseQueue, 
when all mappers/reducers are finished:*
{code:java}
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.lastStep(AMSimulator.java:224)
        at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.lastStep(MRAMSimulator.java:401)
        at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.processResponseQueue(MRAMSimulator.java:195)
        at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
        at 
org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:101)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
[TaskRunner$Task.run|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java#L101]
 calls AMSimulator#middleStep.
 Then, in 
[MRAMSimulator.processResponseQueue|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java#L194-L196],
 there's a code piece that checks for completed mappers and reducers. 
 If the finished mappers are greater than or equal to all mappers and same with 
reducers, the lastStep will be called.
{code:java}
if (mapFinished >= mapTotal && reduceFinished >= reduceTotal) {
  lastStep();
}
{code}
*Stacktrace #2: Call to lastStep from 
[TaskRunner$Task.run|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java#L89-L113]*
{code:java}
        at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.lastStep(AMSimulator.java:224)
        at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.lastStep(MRAMSimulator.java:401)
        at 
org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:106)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
According my code inspections, all NMs and AMs are scheduled with this 
TaskRunner from SLSRunner.
 The call hierarchy of a launch of an AM is this (from bottom to top):

TaskRunner.schedule(Task) (org.apache.hadoop.yarn.sls.scheduler)
{code:java}
SLSRunner.runNewAM(String, String, String, String, long, long, 
List<ContainerSimulator>, ...) (org.apache.hadoop.yarn.sls)
  SLSRunner.runNewAM(String, String, String, String, long, long, 
List<ContainerSimulator>, ...) (org.apache.hadoop.yarn.sls)
    SLSRunner.createAMForJob(Map) (org.apache.hadoop.yarn.sls)
      SLSRunner.startAMFromSLSTrace(String) (org.apache.hadoop.yarn.sls)
        SLSRunner.startAM() (org.apache.hadoop.yarn.sls)
          SLSRunner.start() (org.apache.hadoop.yarn.sls)
            SLSRunner.run(String[]) (org.apache.hadoop.yarn.sls){code}
As an implementation of the AM is class of 
[AMSimulator|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java]
 that extends TaskRunner.Task, that implements the Runnable interface, all 
interesting things are happening in 
[org.apache.hadoop.yarn.sls.scheduler.TaskRunner.Task#run|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java#L89-L113].
 Initially, the field _nextTime_ is equal to _startTime_, so the firstStep 
method is invoked.
 For subsequent calls of run and while _nextRun_ < _endTime_, middleStep is 
executed.
 The field called '_nextRun_' is always incremented with the value of 
_repeatInterval_ (which is 1000ms with the default config). 
 This means that all AMSimulator tasks are getting scheduled in every second.
 Once '_nextRun_' reaches '_endTime_' (it becomes greater) then lastStep will 
be called.
h2. Conclusion for duplicate Job IDs

These 2 calls to lastStep are the main reason of the duplicate applicationID in 
the jobruntime.csv file.
 It's not trivial for me why this lastStep method is invoked through 
[AMSimulator#middleStep|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L209]
 and ultimately through 
[AMSimulator#processResponseQueue|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L212]
 and from the main loop of the TaskRunner$Task.
*I suppose this method should be invoked only once per AM!*

What is even more interesting that 9 out of 10 apps had this method called 
twice according to this log file: 
[apps-shuttingdown.log|https://github.com/szilard-nemeth/linux-env/blob/0d41e4dbda5e3a22105c4fe27f540ae8004857fe/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/grepped/apps-shuttingdown.log]
 .
 But for the last application it is only called once:
{code:java}
2020-12-22 04:09:47,892 INFO appmaster.AMSimulator: Application 
application_1608638719822_0010 is shutting down. lastStep Stacktrace
{code}
All I can see is that the only call to lastStep for app 0010 is this:
 (This is from [log 
file|https://raw.githubusercontent.com/szilard-nemeth/linux-env/master/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/output.log])
{code:java}
2020-12-22 04:09:47,892 INFO appmaster.AMSimulator: Application 
application_1608638719822_0010 is shutting down. lastStep Stacktrace
java.lang.Exception
        at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.lastStep(AMSimulator.java:224)
        at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.lastStep(MRAMSimulator.java:401)
        at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.processResponseQueue(MRAMSimulator.java:195)
        at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
        at 
org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:101)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
_*This is the call from MRAMSimulator.processResponseQueue that verifies the 
number of completed mappers/reducers.*_
 _*The other call that checks the timestamps in TaskRunner$Task.run is not 
called, meaning that the last application never reaches its intended running 
time.*_
 _*This could be counted as "another bug", but unfortunately I wasn't be able 
to find out why this anomaly happens.*_
h2. Other observations

If I grep for any container ID that belongs to any of the 9 applications that 
had duplicate Job IDs in the jobruntime.csv file, each of the apps have a log 
record like this in the output.log:
{code:java}
2020-12-22 04:07:11,980 INFO scheduler.AbstractYarnScheduler: Container 
container_1608638719822_0001_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
{code}
[See an example 
here|https://github.com/szilard-nemeth/linux-env/blob/96ed3d8af9f4677866652bb57153713b29f24a98/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/grepped/container_1608638719822_0001_01_000001.log#L32]
 I think this is also happening because of the duplicate call to the lastStep 
method.
h2. Possible fix for duplicate Job IDs

The task is to prevent lastStep to be called twice.
 Without understanding the reason of the two calls above and the potential 
side-effects of the removal of any of these calls, let's check what lastStep 
does.
 The implementation of lastStep for MRAMSimulator delegates to the superclass: 
[AMSimulator#lastStep|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L222-L273].
 *There are several things happening in this method:*
 - App is unregistered / untracked.

 - If the amContainer is not null, the NM of the AM will be notified and the AM 
container will be marked as completed 
[here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L231-L238]

 - The AM is unregistered from the RM 
[here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L246-L263].

 - The finish time of the AM is set, this is the only write access of this 
field: 
[here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L265].

 - The job's runtime information will be persisted to the jobruntime.csv file 
[here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L266-L272].

*I think all of these actions must be prevented to be called more than once!*

As there are only one update of a field in the lastStep method, without 
introducing a new boolean flag to track if lastStep was called or not, a quick 
and dirty solution is to check if the 
_org.apache.hadoop.yarn.sls.appmaster.AMSimulator#simulateFinishTimeMS_ field 
is modified (i.e. greater then zero, which is the default value of long 
fields). As the only writer of this field is one write occurrence from the 
lastStep method, it's safe to check this. If it is non-zero or greater than 
zero, it means lastStep was called before.
h2. Test run with the fix

The fix patch is added 
[here|https://github.com/szilard-nemeth/linux-env/blob/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches/0002-YARN-10427-Prevent-second-call-of-AMSimulator-lastSt.patch]
 It is also uploaded as an attachment to this jira, as a candidate for commit 
as I think it's a proper fix.
 The logs of the "fixed run" can be found here: 
[https://github.com/szilard-nemeth/linux-env/tree/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs]

1. The shutting down messages for applications look way better, there's only 10 
messages and 10 apps, which is correct: 
[apps-shuttingdown.log|https://github.com/szilard-nemeth/linux-env/blob/master/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs/grepped/apps-shuttingdown.log]

2. The 
[jobruntime.csv|https://github.com/szilard-nemeth/linux-env/blob/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs/jobruntime.csv]
 file also looks good. There's one entry per application now.

3. In the 
[output.log|https://github.com/szilard-nemeth/linux-env/blob/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs/output.log]
 file, there are still weird messages when the AM container is finished, for 
all the apps:
{code:java}
root@snemeth-fips2-1 slsrun-out-20201222_063242]# grep "but corresponding 
RMContainer doesn't exist" output.log 
2020-12-22 06:34:40,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0002_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:34:41,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0001_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:35:05,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0003_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:35:10,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0005_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:35:30,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0006_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:36:04,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0009_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:36:04,373 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0008_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:36:20,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0004_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
2020-12-22 06:36:26,315 INFO scheduler.AbstractYarnScheduler: Container 
container_1608647568797_0007_01_000001 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
{code}
So in contrary to my expectations, this is not because of the double-call of 
lastStep.

> Duplicate Job IDs in SLS output
> -------------------------------
>
>                 Key: YARN-10427
>                 URL: https://issues.apache.org/jira/browse/YARN-10427
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler-load-simulator
>    Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
>         Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>            Reporter: Drew Merrill
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: fair-scheduler.xml, inputsls.json, jobruntime.csv, 
> jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

Reply via email to