Hey Matthieu, I just fixed 2.3 with this patch to correct the bitmap logic going forward with jobs where the cluster isn't ordered in counting order. https://github.com/SchedMD/slurm/commit/183888a0dbf1036a31e1764147ee63a5675a4a3d (download patch - https://github.com/SchedMD/slurm/commit/183888a0dbf1036a31e1764147ee63a5675a4a3d.patch)
This fixes the problem going forward, and can probably be applied cleanly to 2.2 systems as well, but doesn't fix the problem for past systems configurations. If your system isn't doesn't change order very often you can manually update the database to fix past jobs though. In an example of how to do this, I have a system snowflake that was defined as snowflake[000-048], but what if the order was really snowflake[001-006,000,007-048]. In the database the order was always represented as snowflake[000-048]. With this fix the order is kept unsorted and bitmaps are now correct. If you want to update old entries you can go into the database and replace the 'cluster_nodes' in the ($cluster_name)_event_table with the correct ordering which will fix past jobs as well. In my case if the system was always snowflake[001-006,000,007-048] instead of snowflake[000-048] I would do this... mysql> select * from snowflake_event_table where node_name=''; +------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+ | time_start | time_end | node_name | cluster_nodes | cpu_count | reason | reason_uid | state | +------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+ | 1311724103 | 1317314431 | | snowflake[000-048] | 430 | Cluster processor count | 4294967294 | 0 | | 1317314431 | 0 | | snowflake[001-006,000,007-048] | 430 | Cluster processor count | 4294967294 | 0 | +------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+ mysql> update snowflake_event_table set cluster_nodes='snowflake[001-006,000,007-048]' where cluster_nodes='snowflake[000-048]' and node_name=''; mysql> select * from snowflake_event_table where node_name=''; +------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+ | time_start | time_end | node_name | cluster_nodes | cpu_count | reason | reason_uid | state | +------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+ | 1311724103 | 1317314431 | | snowflake[001-006,000,007-048] | 430 | Cluster processor count | 4294967294 | 0 | | 1317314431 | 0 | | snowflake[001-006,000,007-048] | 430 | Cluster processor count | 4294967294 | 0 | +------------+------------+-----------+--------------------------------+-----------+-------------------------+------------+-------+ I understand this won't be easy if your system changes shape often, but hopefully for most cases this won't be too troublesome. Your patch should work in all cases as well, but as you are probably aware adds a great deal of overhead. Thanks for reporting the problem, this patch will be in 2.3.1. Danny On Friday September 16 2011 9:37:46 AM Matthieu Hautreux wrote: > Hi, > > we are experimenting a problem with the behavior of the sacct command when > only jobs running on a subset of nodes are requested. In some situations, we > are getting results corresponding to a different subset of nodes. > > The running configuration is > > [root@lascaux0 ~] # grep StorageType /etc/slurm/slurm.conf > AccountingStorageType=accounting_storage/slurmdbd > [root@lascaux0 ~] # grep StorageType /etc/slurm/slurmdbd.conf > StorageType=accounting_storage/mysql > [root@lascaux0 ~] # > > > Example of the problem : > > [root@lascaux0 ~] # sacct -Xno nodelist%100 -N lascaux9014 > > lascaux5011 > > lascaux[5001,5003,5011-5012,5014,5039,5050,5093,5117-5118,5123,5125,5134,5162,5165] > [root@lascaux0 ~] # > > [root@lascaux0 ~] # sacct -Xno nodelist%100 -N lascaux5011 > > lascaux[9011-9012] > > lascaux[9011-9013] > > lascaux[9011-9013] > > lascaux[9011-9013] > > lascaux[9004-9005,9011-9012] > > lascaux[9004-9005,9011-9012] > > lascaux[9004-9005,9011-9012] > [root@lascaux0 ~] # > > This problem seems to be due to the fact that the engine to get the jobs > that match the nodelist criteria compare 2 bitmaps that are not build with > the same reference. > > For MYSQL plugin, the comparison is done in as_mysql_jobacct_process.c:949 : > > job_bitmap = bit_alloc(hostlist_count((*curr_cluster)->hl)); > bit_unfmt(job_bitmap, node_inx); > if (!bit_overlap((*curr_cluster)->asked_bitmap, job_bitmap)) > { > FREE_NULL_BITMAP(job_bitmap); > return 0; > } > > However, asked_bitmap is build with the cluster node list as the reference > and node_inx is build with the node_record_table_ptr reference (example in > accounting_storage_slurmdbd.c:178). However, the order of the nodes in > node_record_table_ptr depends of the order of their description in > slurm.conf but the order in the cluster node list always correspond to the > numerically sorted list of nodes of the cluster . As a result, the 2 > references do not match when the order of the nodes describes in slurm.conf > doest not follow a numerical search, in our case : > > [root@lascaux0 slurm-2.2.7] # grep NodeName= /etc/slurm/slurm.conf | egrep > "9000|5000" > NodeName=lascaux[9000-9199] NodeAddr=lascaux[9000-9199] Sockets=4 > CoresPerSocket=8 ThreadsPerCore=1 RealMemory=128900 Weight=30 State=UNKNOWN > NodeName=lascaux[5000-5197] NodeAddr=lascaux[5000-5197] Sockets=2 > CoresPerSocket=4 ThreadsPerCore=1 RealMemory=23200 Weight=40 State=UNKNOWN > [root@lascaux0 slurm-2.2.7] # > > > A solution would be to ensure that the 2 reference are similar, so node_inx > should be build using a sorted nodelist reference, node the internal bitmap > of slurmctld. However, as currenlty stored entries have a wrong node_inx > value, the good one should be computed again to ensure that everything is > going well again. > > To have a workaround on our clusters that enable to acess our accounting > history with a working nodes subset functionality, I have made a patch > (enclosed) that no longer use node_inx in teh comparison but directly us the > job's node list instead. It is certainly a bit less optimized than using the > index but avoid the coherency problem described without having to deal with > regenerating every node_inx entries already stored in the DB. FYI, the > postgresql part of the patch is not tested/validated but should have the > same problem without the workaround. > > I will let you judge how to best manage this problem, switching to a > certainly less optimized algorithm like in te enclosed patch or modify the > node_inx generation to be coherent and add a hook to ensure recomputation of > all the already stored node_inx entries of jobs and steps. I do not have > figures to compare the 2 algorithms and their respective performances, if > you do not have a clear preference too, I could try to compare them to help > in the choice. > > > Regards, > Matthieu