Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-21 Thread Nils Weskamp
Am 22.04.2018 um 03:04 schrieb Andrew Dalke:
> Here's an implementation of that sketch, applied to the RDKit hash 
> fingerprint:

Nice work. If brute-force approaches like this (or methods based on
genetic algorithms etc.) are the only way to reverse a fingerprint, one
could probably come up with a fingerprint that allows for pretty secure
structure sharing by using many iterations of a strong cryptographic
hash function that is really slow to calculate.

Nils

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-21 Thread Andrew Dalke
On Apr 21, 2018, at 01:55, Andrew Dalke  wrote:
> Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass 
> the screening test. If not, the structure contains no carbons, so repeat with 
> other elements until you find an atom which passes. Successively either add 
> an atom+bond or connect two existing atoms with a bond, fingerprint the 
> result, and do the screening test. If it does not pass then that modification 
> was not permitted. Use a breadth-first search which prioritizes branching and 
> rings to avoid chains longer than the maximum enumeration size.

Here's an implementation of that sketch, applied to the RDKit hash fingerprint:

  http://dalkescientific.com/rev_eng_fp.py

It works well for small structures:

% python rev_env_fp.py
No SMILES given. Using caffeine.
Current best guess is C=C with 2 bits of 759
Current best guess is Cc=O with 6 bits of 759
Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O

Here's aspirin:

% python rev_env_fp.py 'O=C(C)Oc1c1C(=O)O'
Found! CC(=O)Oc1c1C(=O)O

Capsicum is close, only missing a methyl in the tail.

% python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)C=CC(C)C'
Current best guess is CNC(=O)C=CC(C)C with 100 bits of 384
Current best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
Best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384


For omeprazole it only finds half of the structure:

% python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C'
Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863

For estradiol it gets stuck finding another cyclohexane instead of the 
cyclopentane:

% python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@@H]2O)O'
Current best guess is CC12CCC(O)C21C with 163 bits of 583
Current best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
Best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583


Note: it's currently set up to only consider the elements
  ["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"]

Edit the 'elements' list of you want to include more possibilities. This is 
more likely to run into a dead-end.


The current code assumes that when I grow by one atom, if fp(mol + new atom) is 
a subset of the target fingerprint, then mol + new_atom is a subgraph of the 
target structure.

This can be resolved by setting up a search tree, but then it needs to be more 
careful about backtracking and pruning, and that's too much work for an evening 
of programming.

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss