[Rdkit-discuss] off_coverage, Z3, and test set reduction

Andrew Dalke Tue, 08 Jun 2021 13:46:35 -0700

Hi all,

  I'm excited about a tool I developed for the Open Force Field Initiative and 
thought to share a bit about it here.


It's called "off_coverage", currently in a pull-request at 
https://github.com/openforcefield/cheminformatics-toolkit-equivalence and also 
available from my Sourcehut repo at https://hg.sr.ht/~dalke/off_coverage .

The overall goal was to improve test case selection. Quite a few of the tests 
use an SDF test set with 371 records. These were created to test atom type 
assignment, and re-purposed as general-purpose test cases. At the beginning 
this was fine, but now the unittests take a long time to run.

It's also likely that some of the records do not add additional testing 
strength. For example, several of them have valences that RDKit does not 
accept, and there doesn't appear useful to have multiple test cases for that.

My solution casts this into a feature space problem. There are any number of 
features: can RDKit parse the structure? Can the OpenFF toolkit convert the 
structure into its internal molecule representation? Can it convert the 
structure back to an RDKit molecule and generate a SMILES? Does the resulting 
SMILES match the original?

These might be expressed as a mapping from id to a list of features:

id123 parse_ok to_openff_ok from_openff_ok
id456 parse_ok to_openff_ok from_openff_err
id789 parse_err
id999 parse_ok to_openff_ok from_openff_ok

Assuming all useful features can be detected, test set minimization reduces to 
the set cover problem - https://en.wikipedia.org/wiki/Set_cover_problem .

These can be solved with off-the-shelf tools. I used Z3, 
https://github.com/Z3Prover/z3 , which solves it by expressing the problem as 
funding a solution to a Boolean-and of a set of Boolean-ors:

  (and
     (or id123 id456 id999)    ; parse_ok
     id789                     ; parse_err
     (or id123 id456 id999)    ; to_openff_ok
     (or id123 id999)          ; from_openff_ok
     id456                     ; from_openff_err 
  )

while minimizing sum(id123 + id456 + id789 + id999). In this case id789 and 
id456 must be selected, as well as one id123 or id999.

This general approach could be used for all sorts of things, like finding a 
reduced set of fingerprints whose union still sets all of the bits set by a 
larger set of fingerprints, selecting a subset of structures which contain all 
of the atom types in the larger structure, etc. (It could also be modified to 
require at least 2 representatives, etc.)

The novel part (to me) is that code coverage - that is, the lines of code that 
Python executes for a given test - can also be used to generate features. This 
information is available via Python's sys.settrace() hook.

For example, you might track the sequence of [(module name, line number), ....] 
for every function call. If two functions have the same sequence, then they 
execute exactly the same code. This is a very strict definition of equivalence.

Or, you might track the set of lines executed for each module, and compare 
those sets. This is more like comparing the overall code coverage for a module.

Or, you might track things at the function call level, rather than the module 
level, in case a test calls a function multiple times, each time possibly 
executing a different code path.

Or you might look at the pairs of (previous line number, current line number), 
which captures some of the branching behavior.


I implemented these variations. Assuming I did it correctly, the minimal subset 
can be as small as 27 records, out of 371!

        • mod-cover selected: 72/371
        • mod-sequence selected: 342/371
        • mod-pairs selected: 72/371
        • func-cover selected: 31/371
        • func-sequence selected: 336/371
        • func-pairs selected: 27/371

In addition, the tool lets you specify an alternative weighting scheme. The 
default is that each entry has a weight of 1, but you might prefer to minimize 
the number of atoms, or minimize the number of features in each entry, or to 
prefer entries from one data set over another.

I designed the tool to usable even outside of the OpenFF installation. If 
you're only interested in the selection part, the "minimize" subcommand takes a 
line-oriented text file with feature information (id, feature labels, and 
feature weights) as input, and a list of selected ids as output. It depends 
only on the "z3-solver" Python package from PyPI.

I hope people find set cover tool and/or the idea of coverage-directed test set 
minimization to be interesting. And perhaps someone is interested enough in the 
latter to see if I implemented my settrace handlers correctly? ;)

Best regards,

                                Andrew 
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] off_coverage, Z3, and test set reduction

Reply via email to