On Apr 21, 2018, at 01:55, Andrew Dalke wrote:
> Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass
> the screening test. If not, the structure contains no carbons, so repeat with
> other elements until you find an atom which passes. Successively either add
> an atom+bond or connect two existing atoms with a bond, fingerprint the
> result, and do the screening test. If it does not pass then that modification
> was not permitted. Use a breadth-first search which prioritizes branching and
> rings to avoid chains longer than the maximum enumeration size.
Here's an implementation of that sketch, applied to the RDKit hash fingerprint:
http://dalkescientific.com/rev_eng_fp.py
It works well for small structures:
% python rev_env_fp.py
No SMILES given. Using caffeine.
Current best guess is C=C with 2 bits of 759
Current best guess is Cc=O with 6 bits of 759
Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O
Here's aspirin:
% python rev_env_fp.py 'O=C(C)Oc1c1C(=O)O'
Found! CC(=O)Oc1c1C(=O)O
Capsicum is close, only missing a methyl in the tail.
% python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)C=CC(C)C'
Current best guess is CNC(=O)C=CC(C)C with 100 bits of 384
Current best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
Best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
For omeprazole it only finds half of the structure:
% python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C'
Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
For estradiol it gets stuck finding another cyclohexane instead of the
cyclopentane:
% python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@@H]2O)O'
Current best guess is CC12CCC(O)C21C with 163 bits of 583
Current best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
Best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
Note: it's currently set up to only consider the elements
["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"]
Edit the 'elements' list of you want to include more possibilities. This is
more likely to run into a dead-end.
The current code assumes that when I grow by one atom, if fp(mol + new atom) is
a subset of the target fingerprint, then mol + new_atom is a subgraph of the
target structure.
This can be resolved by setting up a search tree, but then it needs to be more
careful about backtracking and pruning, and that's too much work for an evening
of programming.
Cheers,
Andrew
da...@dalkescientific.com
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss