Hey Matthieu,

I just fixed 2.3 with this patch to correct the bitmap logic going forward with 
jobs where the cluster isn't ordered in counting order.
https://github.com/SchedMD/slurm/commit/183888a0dbf1036a31e1764147ee63a5675a4a3d
(download patch - 
https://github.com/SchedMD/slurm/commit/183888a0dbf1036a31e1764147ee63a5675a4a3d.patch)

This fixes the problem going forward, and can probably be applied cleanly to 
2.2 systems as well, but doesn't fix the problem for past systems 
configurations.

If your system isn't doesn't change order very often you can manually update 
the database to fix past jobs though.

In an example of how to do this, I have a system snowflake that was defined as 
snowflake[000-048], but what if the order was really 
snowflake[001-006,000,007-048].

In the database the order was always represented as snowflake[000-048].  With 
this fix the order is kept unsorted and bitmaps are now correct.

If you want to update old entries you can go into the database and replace the 
'cluster_nodes' in the ($cluster_name)_event_table with the correct 
ordering which will fix past jobs as well.

In my case if the system was always snowflake[001-006,000,007-048] instead of 
snowflake[000-048] I would do this...

mysql> select * from snowflake_event_table where node_name='';
+------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+
| time_start | time_end   | node_name | cluster_nodes                  | 
cpu_count | reason                  | reason_uid | state |
+------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+
| 1311724103 | 1317314431 |           | snowflake[000-048]             |       
430 | Cluster processor count | 4294967294 |     0 |
| 1317314431 |          0 |           | snowflake[001-006,000,007-048] |       
430 | Cluster processor count | 4294967294 |     0 |
+------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+

mysql> update snowflake_event_table set 
cluster_nodes='snowflake[001-006,000,007-048]' where 
cluster_nodes='snowflake[000-048]' and 
node_name='';

mysql> select * from snowflake_event_table where node_name='';
+------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+
| time_start | time_end   | node_name | cluster_nodes                  | 
cpu_count | reason                  | reason_uid | state |
+------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+
| 1311724103 | 1317314431 |           | snowflake[001-006,000,007-048] |       
430 | Cluster processor count | 4294967294 |     0 |
| 1317314431 |          0 |           | snowflake[001-006,000,007-048] |       
430 | Cluster processor count | 4294967294 |     0 |
+------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+

I understand this won't be easy if your system changes shape often, but 
hopefully for most cases this won't be too troublesome.  Your patch should work 
in all cases as well, but as you are probably aware adds a great deal of 
overhead.

Thanks for reporting the problem, this patch will be in 2.3.1.
Danny


On Friday September 16 2011 9:37:46 AM Matthieu Hautreux wrote:
> Hi,
> 
> we are experimenting a problem with the behavior of the sacct command when
> only jobs running on a subset of nodes are requested. In some situations, we
> are getting results corresponding to a different subset of nodes.
> 
> The running configuration is
> 
> [root@lascaux0 ~] # grep StorageType /etc/slurm/slurm.conf
> AccountingStorageType=accounting_storage/slurmdbd
> [root@lascaux0 ~] # grep StorageType /etc/slurm/slurmdbd.conf
> StorageType=accounting_storage/mysql
> [root@lascaux0 ~] #
> 
> 
> Example of the problem :
> 
> [root@lascaux0 ~] # sacct -Xno nodelist%100 -N lascaux9014
> 
> lascaux5011
> 
> lascaux[5001,5003,5011-5012,5014,5039,5050,5093,5117-5118,5123,5125,5134,5162,5165]
> [root@lascaux0 ~] #
> 
> [root@lascaux0 ~] # sacct -Xno nodelist%100 -N lascaux5011
> 
> lascaux[9011-9012]
> 
> lascaux[9011-9013]
> 
> lascaux[9011-9013]
> 
> lascaux[9011-9013]
> 
> lascaux[9004-9005,9011-9012]
> 
> lascaux[9004-9005,9011-9012]
> 
> lascaux[9004-9005,9011-9012]
> [root@lascaux0 ~] #
> 
> This problem seems to be due to the fact that the engine to get the jobs
> that match the nodelist criteria compare 2 bitmaps that are not build with
> the same reference.
> 
> For MYSQL plugin, the comparison is done in as_mysql_jobacct_process.c:949 :
> 
>                 job_bitmap = bit_alloc(hostlist_count((*curr_cluster)->hl));
>                 bit_unfmt(job_bitmap, node_inx);
>                 if (!bit_overlap((*curr_cluster)->asked_bitmap, job_bitmap))
> {
>                         FREE_NULL_BITMAP(job_bitmap);
>                         return 0;
>                 }
> 
> However, asked_bitmap is build with the cluster node list as the reference
> and node_inx is build with the node_record_table_ptr reference (example in
> accounting_storage_slurmdbd.c:178). However, the order of the nodes in
> node_record_table_ptr depends of the order of their description in
> slurm.conf but the order in the cluster node list always correspond to the
> numerically sorted list of nodes of the cluster . As a result, the 2
> references do not match when the order of the nodes describes in slurm.conf
> doest not follow a numerical search, in our case :
> 
> [root@lascaux0 slurm-2.2.7] # grep NodeName= /etc/slurm/slurm.conf | egrep
> "9000|5000"
> NodeName=lascaux[9000-9199] NodeAddr=lascaux[9000-9199] Sockets=4
> CoresPerSocket=8 ThreadsPerCore=1 RealMemory=128900 Weight=30 State=UNKNOWN
> NodeName=lascaux[5000-5197] NodeAddr=lascaux[5000-5197] Sockets=2
> CoresPerSocket=4 ThreadsPerCore=1 RealMemory=23200 Weight=40 State=UNKNOWN
> [root@lascaux0 slurm-2.2.7] #
> 
> 
> A solution would be to ensure that the 2 reference are similar, so node_inx
> should be build using a sorted nodelist reference, node the internal bitmap
> of slurmctld. However, as currenlty stored entries have a wrong node_inx
> value, the good one should be computed again to ensure that everything is
> going well again.
> 
> To have a workaround on our clusters that enable to acess our accounting
> history with a working nodes subset functionality, I have made a patch
> (enclosed) that no longer use node_inx in teh comparison but directly us the
> job's node list instead. It is certainly a bit less optimized than using the
> index but avoid the coherency problem described without having to deal with
> regenerating every node_inx entries already stored in the DB. FYI, the
> postgresql part of the patch is not tested/validated but should have the
> same problem without the workaround.
> 
> I will let you judge how to best manage this problem, switching to a
> certainly less optimized algorithm like in te enclosed patch or modify the
> node_inx generation to be coherent and add a hook to ensure recomputation of
> all the already stored node_inx entries of jobs and steps. I do not have
> figures to compare the 2 algorithms and their respective performances, if
> you do not have a clear preference too, I could try to compare them to help
> in the choice.
> 
> 
> Regards,
> Matthieu

Reply via email to