Hi all, I'm excited about a tool I developed for the Open Force Field Initiative and thought to share a bit about it here.
It's called "off_coverage", currently in a pull-request at https://github.com/openforcefield/cheminformatics-toolkit-equivalence and also available from my Sourcehut repo at https://hg.sr.ht/~dalke/off_coverage . The overall goal was to improve test case selection. Quite a few of the tests use an SDF test set with 371 records. These were created to test atom type assignment, and re-purposed as general-purpose test cases. At the beginning this was fine, but now the unittests take a long time to run. It's also likely that some of the records do not add additional testing strength. For example, several of them have valences that RDKit does not accept, and there doesn't appear useful to have multiple test cases for that. My solution casts this into a feature space problem. There are any number of features: can RDKit parse the structure? Can the OpenFF toolkit convert the structure into its internal molecule representation? Can it convert the structure back to an RDKit molecule and generate a SMILES? Does the resulting SMILES match the original? These might be expressed as a mapping from id to a list of features: id123 parse_ok to_openff_ok from_openff_ok id456 parse_ok to_openff_ok from_openff_err id789 parse_err id999 parse_ok to_openff_ok from_openff_ok Assuming all useful features can be detected, test set minimization reduces to the set cover problem - https://en.wikipedia.org/wiki/Set_cover_problem . These can be solved with off-the-shelf tools. I used Z3, https://github.com/Z3Prover/z3 , which solves it by expressing the problem as funding a solution to a Boolean-and of a set of Boolean-ors: (and (or id123 id456 id999) ; parse_ok id789 ; parse_err (or id123 id456 id999) ; to_openff_ok (or id123 id999) ; from_openff_ok id456 ; from_openff_err ) while minimizing sum(id123 + id456 + id789 + id999). In this case id789 and id456 must be selected, as well as one id123 or id999. This general approach could be used for all sorts of things, like finding a reduced set of fingerprints whose union still sets all of the bits set by a larger set of fingerprints, selecting a subset of structures which contain all of the atom types in the larger structure, etc. (It could also be modified to require at least 2 representatives, etc.) The novel part (to me) is that code coverage - that is, the lines of code that Python executes for a given test - can also be used to generate features. This information is available via Python's sys.settrace() hook. For example, you might track the sequence of [(module name, line number), ....] for every function call. If two functions have the same sequence, then they execute exactly the same code. This is a very strict definition of equivalence. Or, you might track the set of lines executed for each module, and compare those sets. This is more like comparing the overall code coverage for a module. Or, you might track things at the function call level, rather than the module level, in case a test calls a function multiple times, each time possibly executing a different code path. Or you might look at the pairs of (previous line number, current line number), which captures some of the branching behavior. I implemented these variations. Assuming I did it correctly, the minimal subset can be as small as 27 records, out of 371! • mod-cover selected: 72/371 • mod-sequence selected: 342/371 • mod-pairs selected: 72/371 • func-cover selected: 31/371 • func-sequence selected: 336/371 • func-pairs selected: 27/371 In addition, the tool lets you specify an alternative weighting scheme. The default is that each entry has a weight of 1, but you might prefer to minimize the number of atoms, or minimize the number of features in each entry, or to prefer entries from one data set over another. I designed the tool to usable even outside of the OpenFF installation. If you're only interested in the selection part, the "minimize" subcommand takes a line-oriented text file with feature information (id, feature labels, and feature weights) as input, and a list of selected ids as output. It depends only on the "z3-solver" Python package from PyPI. I hope people find set cover tool and/or the idea of coverage-directed test set minimization to be interesting. And perhaps someone is interested enough in the latter to see if I implemented my settrace handlers correctly? ;) Best regards, Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss