One more comment in case it's useful for others. If you use raw IB guids as the 
switchname in topology.conf (what comes out of ibnetdiscover), you can exhaust 
the max line length allowed by default with a large fat-tree system. In order 
to define a (fictitious) top-level switch which aggregates all of the L2 
switches referenced in this discussion, a bump in max line length was need in 
our case which was accomplished with this small patch:

-k

--- src/common/parse_config.c.  2012-12-06 14:29:26.000000000 -0600
+++ src/common/parse_config.c   2013-01-02 04:05:46.000000000 -0600
@@ -69,7 +69,7 @@
 strong_alias(s_p_hashtbl_destroy,      slurm_s_p_hashtbl_destroy);
 strong_alias(s_p_parse_file,           slurm_s_p_parse_file);
 
-#define BUFFER_SIZE 4096
+#define BUFFER_SIZE 8192
 
 #define CONF_HASH_LEN 26




On Jan 1, 2013, at Jan 1, 8:39 PM, Karl Schulz <[email protected]> wrote:

> 
> One additional item to mention for context is that the tests with --switches 
> are made with the reordering patch discussed at:
> 
> https://groups.google.com/forum/?fromgroups=#!topic/slurm-devel/WuGPIg_U3sg
> 
> Without this, it doesn't appear that the topology file is honored with 
> --switches.  One other observation is that using --switches while also 
> requesting a specific host with "-w host" can result in scheduling which 
> appears to violate the topology config (assuming of course I haven't screwed 
> something up in the config).
> 
> -k
> 
> 
> On Jan 1, 2013, at 3:52 PM, Karl Schulz <[email protected]> wrote:
> 
>> 
>> Hello again,
>> 
>> Apologies for the slow barrage of rookie questions. I was curious if others 
>> in the community see any slurm command interactivity degradation when 
>> attempting to use the topology/tree plugin at large scale and the 
>> "--switches" option?
>> 
>> If I enable topology/tree in version 2.4.5 and use the --switches flag for a 
>> single job, I can verify that it does as we expect to honor the switch 
>> topology provided in topology.conf.   However, on the same idle system, a 
>> small job submitted without the "--switches"  option which should fit on 1 
>> switch, is not scheduled to 1 switch. I understand from the docs, that the 
>> scheduling may be sub-optimal, but was surprised to see that happen when 
>> there were not any actively running jobs.  Consequently, the remaining 
>> discussion is focused on testing with the --switches flag.
>> 
>> Note that based on the guidance provided in the docs for the topology.conf 
>> configuration, I have only defined 2 levels of the fat-tree topology (the 
>> first level connected to endpoint hosts, and the 2nd level which connects to 
>> all level-1 switches).  This attempts to minimize how many switches are 
>> provided to the plugin, but it is still decent in size because of a large 
>> number of hosts (6400 in this case, with > 300 L1 switches and 288 L2 
>> switches).
>> 
>> The issue seems to arise once I start submitting multiple jobs with 
>> "--switches" requests.  Once there are more than a few, the interactivity of 
>> commands like squeue and sinfo intermittently decreases dramatically (e.g. 
>> more than a minute at times, more frequently 5-20 seconds).
>> 
>> This observation is derived from a simple test which  submits 40 small 
>> sbatch jobs to an otherwise idle system.  As the jobs are very small, the 
>> scheduler should be able to have all jobs running simultaneously.  
>> 
>> Test Mode 1 (no topology requirements):
>> 
>> In the first test mode, I submit the jobs without any extra "--switches" 
>> options, and in this case, slurm schedules all the jobs almost instantly. In 
>> this mode, there is no noticeable interactive command degradation.
>> 
>> Test Mode 2 (each job includes an additional --switches option):
>> 
>> In the second test mode, each job adds a --switch=[num_switches] option with 
>> "num_switches" chosen to be the smallest value for which the job can be 
>> accommodated topologically. The 40 jobs are submitted in sequence from a 
>> simple shell script and as the jobs begin to be accepted, slurm command 
>> interactivity becomes erratic.  In this mode, I have seen squeue -u <userid> 
>> take over a minute to complete.  In addition to the sluggish interactivity 
>> (which seems to disappear eventually after a subset of the jobs are 
>> running), it takes much longer for the topology jobs to schedule. I 
>> certainly understand that the space-filling curve algorithm will slow this 
>> process down, but it seems to take more than a factor of 2 longer on an idle 
>> system.  Would you expect this? The strange part is that some of the jobs 
>> continue to report as pending for resources, although there are thousands of 
>> nodes which could satisfy the min switch request.  
>> 
>> To quantify the difference a bit; the time required from submission of first 
>> job to completion of the 40th job is as follows:
>> 
>> (1) Test Mode 1 (no topo requirements): ~5 minutes 
>> (2) Test Mode 2 (each job with --switch option): ~12 minutes
>> 
>> 
>> Any thoughts on what might be amiss based on these tests?
>> 
>> Thanks again for any advice,
>> 
>> Karl
>> 
>> 
>> 
>> 

Reply via email to