Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures? --> We have just published a preprint to this!

2020-05-15 Thread Bennion, Brian via Rdkit-discuss
Thank you for the link. I will look at it!


---
Sent from Workspace ONE Boxer

On May 14, 2020 at 11:25:49 PM PDT, Tuan Le  wrote:

Hi Brian,



I was working on a study to deduce molecular structures given ECFP fingerprints 
and came across your open question on the rdkit mailing-list 
(https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg07851.html).

I really enjoyed reading the discussion in the mailing list which presented 
also references to related work (thanks to Nils !).

We have just published a preprint on a study to reverse-engineer molecular 
structures given ECFP descriptors: 
https://chemrxiv.org/articles/Neuraldecipher_-_Reverse-Engineering_ECFP_Fingerprints_to_Their_Molecular_Structures/12286727.

Our learning approach maps ECFPs to latent molecular descriptors which are then 
decoded back to SMILES representation.



I don’t know how to directly respond to the open thread on sourceforge, so I 
hope sending this email suffices.

(Mail is also sent to rdkit-mailinglist, Nils Weskamp and Andrew Dalke).





Best regards,



Tuan



Tuan Le

Ph.D Student Research Scientist

Machine Learning Research







Bayer AG

Research & Development, Pharmaceuticals

Machine Learning Research

Müllerstr. 178, Building S110/702

13353 Berlin, Germany







___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures? --> We have just published a preprint to this!

2020-05-14 Thread Tuan Le
Hi Brian, 

 

I was working on a study to deduce molecular structures given ECFP fingerprints 
and came across your open question on the rdkit mailing-list 
(https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg07851.html).
 

I really enjoyed reading the discussion in the mailing list which presented 
also references to related work (thanks to Nils !).

We have just published a preprint on a study to reverse-engineer molecular 
structures given ECFP descriptors: 
https://chemrxiv.org/articles/Neuraldecipher_-_Reverse-Engineering_ECFP_Fingerprints_to_Their_Molecular_Structures/12286727.

Our learning approach maps ECFPs to latent molecular descriptors which are then 
decoded back to SMILES representation. 

 

I don’t know how to directly respond to the open thread on sourceforge, so I 
hope sending this email suffices.

 (Mail is also sent to rdkit-mailinglist, Nils Weskamp and Andrew Dalke).

 

 

Best regards,

 

Tuan 

 

Tuan Le

Ph.D Student Research Scientist

Machine Learning Research

 



 

Bayer AG

Research & Development, Pharmaceuticals

Machine Learning Research

Müllerstr. 178, Building S110/702

13353 Berlin, Germany

 

 



smime.p7s
Description: S/MIME cryptographic signature
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-23 Thread Nathan Brown
Hi All,

I remember reading Dave’s patent when I was writing my CoG program around 2002. 
I had to write my own fingerprinting algorithm, based on Daylight fingerprints, 
called Fingal, which I could use to decode the fingerprints back into the 
structure(s) represented by it. I also had to convince Johnny Gasteiger that 
molecular fingerprints were useful.

https://pubs.acs.org/doi/abs/10.1021/ci034290p

I found it only worked on count fingerprints, since the binary one lost too 
much information. Never tried it on ECFP-like fingerprints. Still have the code 
somewhere…

Cheers, Nath




From: David Cosgrove 
Date: Monday, 23 April 2018 at 17:28
To: Brice Hoffmann 
Cc: RDKit Discuss 
Subject: Re: [Rdkit-discuss] Any known papers on reverse engineering 
fingerprints into structures?

Hi all,
I’ve just had the attached from Roger Sayle, which might be of interest.
Dave

Hi Andrew and Dave,

John (Mayfield) has just pointed me at the very interesting discussion
raging on sourceforge.
Alas, I've no idea how to post/tweet/snapchat a reply, but thought I'd at
least contribute this
small nugget of trivia:  US5434796A
https://patents.google.com/patent/US5434796A
where Dave W. managed to patent any application of genetic algorithms to
optimizing an
objective function of a molecular structure [which is optimistically broad,
and in its day may
even have covered Andrew's rev_eng_fp.py, which is very impressive by the
way.]  This is
probably the closest thing to a publication around Dave's work with GA's
that was mentioned
on the thread (I think he only authored four or five papers in his
lifetime).

Best wishes to the both of you.  Keep up the great discussion.

Cheers,
Roger
--
Roger Sayle, PhD.
CEO and founder
NextMove Software Limited
Registered in England No. 07588305
Registered Office: Innovation Centre (Unit 23), Cambridge Science Park,
Cambridge, CB4 0EY


On Mon, 23 Apr 2018 at 17:11, Brice Hoffmann 
mailto:brice.hoffm...@iktos.com>> wrote:
Hi,
Another option is to use generative models that uses fingerprints as input (ex: 
https://arxiv.org/abs/1701.01329, 
https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.7b00346). If you use as a 
scoring function of the generated molecules the Tanimoto Distance to a given 
fingerprint, you can often retrieve the original compound.
At Iktos we develop such methods and it work pretty well !
Best regards,
Brice



2018-04-23 16:18 GMT+02:00 Andrew Dalke 
mailto:da...@dalkescientific.com>>:
On Apr 23, 2018, at 14:54, Brian Cole 
mailto:col...@gmail.com>> wrote:
> Unfortunately it doesn't work on circular/ECFP-like fingerprints.

To be fair, you didn't mention that was a requirement. ;)

> It has the requirement that the fingerprint be a substructure fingerprint as 
> you described.

Could you elaborate on your goal?

I used RDKitFingerprint because it was the easiest. It was something I could do 
in a day to demonstrate that it is possible to reverse engineer some 
fingerprints.

I think it's possible to do something similar for circular fingerprints. It 
would mean generating all possible subgraphs of a given radius, which is doable 
for r=2 or r=3, and probably r=4. RDKit has a way to look at the circular 
environment around a a specific atom rather than the entire fingerprint, so 
that can be used to generate a seed point. Once that's found, it can be 
expanded to one of its neighbor atoms.

Another problem is that the Morgan fingerprint algorithm really wants sanitized 
structures, which I didn't need to worry about for the hash fingerprints.

Instead of a day of work, it's going to take a couple of weeks of work, which 
requires time and money.

My advice though is that it's surely possible to determine some structure 
information from the circular fingerprint. If your use case says there should 
be no information leak (other than what's possible by full 
brute-force-enumeration) then don't exchange fingerprints.

But leaking information is not really what I thought of by "reverse engineer".

For example, if I want to check if any of the Morgan fingerprints with r=2 
contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using 
just the c(O)c as the fromAtoms. This gives:


% echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 
useBondTypes=1 fromAtoms=3,4,5,6
#software=RDKit/2016.09.3 chemfp/3.2.1
#date=2018-04-23T14:03:24
0002820014

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-23 Thread David Cosgrove
Hi all,
I’ve just had the attached from Roger Sayle, which might be of interest.
Dave

Hi Andrew and Dave,

John (Mayfield) has just pointed me at the very interesting discussion
raging on sourceforge.
Alas, I've no idea how to post/tweet/snapchat a reply, but thought I'd at
least contribute this
small nugget of trivia:  US5434796A
https://patents.google.com/patent/US5434796A
where Dave W. managed to patent any application of genetic algorithms to
optimizing an
objective function of a molecular structure [which is optimistically broad,
and in its day may
even have covered Andrew's rev_eng_fp.py, which is very impressive by the
way.]  This is
probably the closest thing to a publication around Dave's work with GA's
that was mentioned
on the thread (I think he only authored four or five papers in his
lifetime).

Best wishes to the both of you.  Keep up the great discussion.

Cheers,
Roger
--
Roger Sayle, PhD.
CEO and founder
NextMove Software Limited
Registered in England No. 07588305
Registered Office: Innovation Centre (Unit 23), Cambridge Science Park,
Cambridge, CB4 0EY



On Mon, 23 Apr 2018 at 17:11, Brice Hoffmann 
wrote:

> Hi,
> Another option is to use generative models that uses fingerprints as input
> (ex: https://arxiv.org/abs/1701.01329,
> https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.7b00346). If you use
> as a scoring function of the generated molecules the Tanimoto Distance to
> a given fingerprint, you can often retrieve the original compound.
> At Iktos we develop such methods and it work pretty well !
> Best regards,
> Brice
>
>
>
> 2018-04-23 16:18 GMT+02:00 Andrew Dalke :
>
>> On Apr 23, 2018, at 14:54, Brian Cole  wrote:
>> > Unfortunately it doesn't work on circular/ECFP-like fingerprints.
>>
>> To be fair, you didn't mention that was a requirement. ;)
>>
>> > It has the requirement that the fingerprint be a substructure
>> fingerprint as you described.
>>
>> Could you elaborate on your goal?
>>
>> I used RDKitFingerprint because it was the easiest. It was something I
>> could do in a day to demonstrate that it is possible to reverse engineer
>> some fingerprints.
>>
>> I think it's possible to do something similar for circular fingerprints.
>> It would mean generating all possible subgraphs of a given radius, which is
>> doable for r=2 or r=3, and probably r=4. RDKit has a way to look at the
>> circular environment around a a specific atom rather than the entire
>> fingerprint, so that can be used to generate a seed point. Once that's
>> found, it can be expanded to one of its neighbor atoms.
>>
>> Another problem is that the Morgan fingerprint algorithm really wants
>> sanitized structures, which I didn't need to worry about for the hash
>> fingerprints.
>>
>> Instead of a day of work, it's going to take a couple of weeks of work,
>> which requires time and money.
>>
>> My advice though is that it's surely possible to determine some structure
>> information from the circular fingerprint. If your use case says there
>> should be no information leak (other than what's possible by full
>> brute-force-enumeration) then don't exchange fingerprints.
>>
>> But leaking information is not really what I thought of by "reverse
>> engineer".
>>
>> For example, if I want to check if any of the Morgan fingerprints with
>> r=2 contain a phenol, I can ask RDKit to generate the fingerprint for r=2
>> using just the c(O)c as the fromAtoms. This gives:
>>
>>
>> % echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
>> #FPS1
>> #num_bits=2048
>> #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
>> useBondTypes=1 fromAtoms=3,4,5,6
>> #software=RDKit/2016.09.3 chemfp/3.2.1
>> #date=2018-04-23T14:03:24
>> 00028200140044000200
>>   phenol
>>
>>
>> I can then screen using that fingerprint to see which fingerprint match.
>>
>> Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of
>> the are detected by this screen, and there are no false positives.
>>
>> Poof - structural information leakage.
>>
>> The code is at the bottom of this email. It depends on the commercial
>> version of chemfp.
>>
>>
>> > It seems the evolutionary/genetic algorithm approach is the current
>> state-of-the-art for decoding circular/ECFP-like fingerprints.
>>
>> Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with
>> Daylight hash fingerprints. I don't think we know that GAs have ever been
>> used to reverse engineer circular fingerprints.
>>
>>
>> > Historical question

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-23 Thread Brice Hoffmann
Hi,
Another option is to use generative models that uses fingerprints as input
(ex: https://arxiv.org/abs/1701.01329,
https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.7b00346). If you use as
a scoring function of the generated molecules the Tanimoto Distance to a
given fingerprint, you can often retrieve the original compound.
At Iktos we develop such methods and it work pretty well !
Best regards,
Brice



2018-04-23 16:18 GMT+02:00 Andrew Dalke :

> On Apr 23, 2018, at 14:54, Brian Cole  wrote:
> > Unfortunately it doesn't work on circular/ECFP-like fingerprints.
>
> To be fair, you didn't mention that was a requirement. ;)
>
> > It has the requirement that the fingerprint be a substructure
> fingerprint as you described.
>
> Could you elaborate on your goal?
>
> I used RDKitFingerprint because it was the easiest. It was something I
> could do in a day to demonstrate that it is possible to reverse engineer
> some fingerprints.
>
> I think it's possible to do something similar for circular fingerprints.
> It would mean generating all possible subgraphs of a given radius, which is
> doable for r=2 or r=3, and probably r=4. RDKit has a way to look at the
> circular environment around a a specific atom rather than the entire
> fingerprint, so that can be used to generate a seed point. Once that's
> found, it can be expanded to one of its neighbor atoms.
>
> Another problem is that the Morgan fingerprint algorithm really wants
> sanitized structures, which I didn't need to worry about for the hash
> fingerprints.
>
> Instead of a day of work, it's going to take a couple of weeks of work,
> which requires time and money.
>
> My advice though is that it's surely possible to determine some structure
> information from the circular fingerprint. If your use case says there
> should be no information leak (other than what's possible by full
> brute-force-enumeration) then don't exchange fingerprints.
>
> But leaking information is not really what I thought of by "reverse
> engineer".
>
> For example, if I want to check if any of the Morgan fingerprints with r=2
> contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using
> just the c(O)c as the fromAtoms. This gives:
>
>
> % echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
> #FPS1
> #num_bits=2048
> #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
> useBondTypes=1 fromAtoms=3,4,5,6
> #software=RDKit/2016.09.3 chemfp/3.2.1
> #date=2018-04-23T14:03:24
> 
> 
> 
> 00028000
> 
> 02001000
> 0400
> 44000200
> phenol
>
>
> I can then screen using that fingerprint to see which fingerprint match.
>
> Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of the
> are detected by this screen, and there are no false positives.
>
> Poof - structural information leakage.
>
> The code is at the bottom of this email. It depends on the commercial
> version of chemfp.
>
>
> > It seems the evolutionary/genetic algorithm approach is the current
> state-of-the-art for decoding circular/ECFP-like fingerprints.
>
> Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with
> Daylight hash fingerprints. I don't think we know that GAs have ever been
> used to reverse engineer circular fingerprints.
>
>
> > Historical question for you since you're the closest we have to a
> chem-informatician historian. :-) Why did these circular/ECFP fingerprints
> come into existence?
>
> I believe you are asking for https://pubs.acs.org/doi/abs/
> 10.1021/ci100050t .
>
>   Extended-connectivity fingerprints (ECFPs) are a novel class of
> topological
>   fingerprints for molecular characterization. Historically, topological
>   fingerprints were developed for substructure and similarity searching.
>   ECFPs were developed specifically for structure−activity modeling.
>
> > my reading of the current literature is that tree/dendritic are
> statistically just as good at virtual screening as circular/ECFP:
>
> Yeah, I don't go there. I leave concepts like "just as good" or "better"
> to people who have experimental data they can use for the comparison.
>
>
> Andrew
> da...@dalkescientific.com
>
> == Code to find which Morgan fingerprints contain a phenol substructure ==
>
> import chemfp
> from chemfp import bitops, search
>
> arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False)
> print("Fingerprint type:", arena.metadata.type)
>
> # Want to find structures 

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-23 Thread Andrew Dalke
On Apr 23, 2018, at 14:54, Brian Cole  wrote:
> Unfortunately it doesn't work on circular/ECFP-like fingerprints.

To be fair, you didn't mention that was a requirement. ;)

> It has the requirement that the fingerprint be a substructure fingerprint as 
> you described.

Could you elaborate on your goal?

I used RDKitFingerprint because it was the easiest. It was something I could do 
in a day to demonstrate that it is possible to reverse engineer some 
fingerprints.

I think it's possible to do something similar for circular fingerprints. It 
would mean generating all possible subgraphs of a given radius, which is doable 
for r=2 or r=3, and probably r=4. RDKit has a way to look at the circular 
environment around a a specific atom rather than the entire fingerprint, so 
that can be used to generate a seed point. Once that's found, it can be 
expanded to one of its neighbor atoms.

Another problem is that the Morgan fingerprint algorithm really wants sanitized 
structures, which I didn't need to worry about for the hash fingerprints.

Instead of a day of work, it's going to take a couple of weeks of work, which 
requires time and money.

My advice though is that it's surely possible to determine some structure 
information from the circular fingerprint. If your use case says there should 
be no information leak (other than what's possible by full 
brute-force-enumeration) then don't exchange fingerprints.

But leaking information is not really what I thought of by "reverse engineer".

For example, if I want to check if any of the Morgan fingerprints with r=2 
contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using 
just the c(O)c as the fromAtoms. This gives:


% echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 
useBondTypes=1 fromAtoms=3,4,5,6
#software=RDKit/2016.09.3 chemfp/3.2.1
#date=2018-04-23T14:03:24
00028200140044000200
phenol


I can then screen using that fingerprint to see which fingerprint match.

Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of the are 
detected by this screen, and there are no false positives.

Poof - structural information leakage.

The code is at the bottom of this email. It depends on the commercial version 
of chemfp.


> It seems the evolutionary/genetic algorithm approach is the current 
> state-of-the-art for decoding circular/ECFP-like fingerprints.

Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with 
Daylight hash fingerprints. I don't think we know that GAs have ever been used 
to reverse engineer circular fingerprints.


> Historical question for you since you're the closest we have to a 
> chem-informatician historian. :-) Why did these circular/ECFP fingerprints 
> come into existence?

I believe you are asking for https://pubs.acs.org/doi/abs/10.1021/ci100050t .

  Extended-connectivity fingerprints (ECFPs) are a novel class of topological
  fingerprints for molecular characterization. Historically, topological
  fingerprints were developed for substructure and similarity searching.
  ECFPs were developed specifically for structure−activity modeling.

> my reading of the current literature is that tree/dendritic are statistically 
> just as good at virtual screening as circular/ECFP: 

Yeah, I don't go there. I leave concepts like "just as good" or "better" to 
people who have experimental data they can use for the comparison.


Andrew
da...@dalkescientific.com

== Code to find which Morgan fingerprints contain a phenol substructure ==

import chemfp
from chemfp import bitops, search

arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False)
print("Fingerprint type:", arena.metadata.type)

# Want to find structures containing phenol

# Adjust the fingerprint type to limit it to the given atoms
fptype = chemfp.get_fingerprint_type(arena.metadata.type + " fromAtoms=3,4,5,6")
query_fp = fptype.parse_molecule_fingerprint("*c1ccc(O)cc1", "smi")

print("Query fingerprint:")
print(bitops.hex_encode(query_fp))
print()

# Find the matching fingerprints
result = search.contains_fp(query_fp, arena)

circular_ids = set(result.get_ids())

# Search the first 100,000 structures
from rdkit import Chem
from chemfp import rdkit_toolkit as T

pat = Chem.MolFromSmarts("*c1ccc(O)cc1")
all_ids = set()
exact_ids = set()
with T.read_molecules(

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-23 Thread Brian Cole
Thanks Andrew, very interesting and useful script!

Unfortunately it doesn't work on circular/ECFP-like fingerprints. It has
the requirement that the fingerprint be a substructure fingerprint as you
described. It seems the evolutionary/genetic algorithm approach is the
current state-of-the-art for decoding circular/ECFP-like fingerprints.

Historical question for you since you're the closest we have to a
chem-informatician historian. :-) Why did these circular/ECFP fingerprints
come into existence? They lose the substructure screening property,
property #2 in the 3 properties you listed: identity, subgraph, similarity.
So they generally seem less powerful. (Good description of why that is the
case here:
https://nextmovesoftware.com/blog/2015/02/16/for-every-fingerprint-optimisation-there-is-an-equal-and-opposite-fingerprint-deterioration/
)

I suppose the argument could be made that circular/ECFP are more powerful
for the similarity properly, i.e., virtual-screening. But my reading of the
current literature is that tree/dendritic are statistically just as good at
virtual screening as circular/ECFP:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3686626/
https://pubs.acs.org/doi/abs/10.1021/ci100062n

The corollary to this question is a curiosity why all the machine learning
research seems to be done with circular/ECFP fingerprints. You would think
the substructure screening property would make the tree/dendritic more
information rich?

Big thanks to everyone here, this has definitely been a very fruitful
discussion for me, hope it is for everyone.

Cheers,
Brian



On Sat, Apr 21, 2018 at 9:04 PM, Andrew Dalke 
wrote:

> On Apr 21, 2018, at 01:55, Andrew Dalke  wrote:
> > Hand-waving sketch: start with a carbon. Generate fingerprint. It should
> pass the screening test. If not, the structure contains no carbons, so
> repeat with other elements until you find an atom which passes.
> Successively either add an atom+bond or connect two existing atoms with a
> bond, fingerprint the result, and do the screening test. If it does not
> pass then that modification was not permitted. Use a breadth-first search
> which prioritizes branching and rings to avoid chains longer than the
> maximum enumeration size.
>
> Here's an implementation of that sketch, applied to the RDKit hash
> fingerprint:
>
>   http://dalkescientific.com/rev_eng_fp.py
>
> It works well for small structures:
>
> % python rev_env_fp.py
> No SMILES given. Using caffeine.
> Current best guess is C=C with 2 bits of 759
> Current best guess is Cc=O with 6 bits of 759
> Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O
>
> Here's aspirin:
>
> % python rev_env_fp.py 'O=C(C)Oc1c1C(=O)O'
> Found! CC(=O)Oc1c1C(=O)O
>
> Capsicum is close, only missing a methyl in the tail.
>
> % python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)C=CC(C)C'
> Current best guess is CNC(=O)C=CC(C)C with 100 bits of 384
> Current best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
> Best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
>
>
> For omeprazole it only finds half of the structure:
>
> % python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C'
> Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
> Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
>
> For estradiol it gets stuck finding another cyclohexane instead of the
> cyclopentane:
>
> % python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@
> @H]2O)O'
> Current best guess is CC12CCC(O)C21C with 163 bits of 583
> Current best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
> Best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
>
>
> Note: it's currently set up to only consider the elements
>   ["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"]
>
> Edit the 'elements' list of you want to include more possibilities. This
> is more likely to run into a dead-end.
>
>
> The current code assumes that when I grow by one atom, if fp(mol + new
> atom) is a subset of the target fingerprint, then mol + new_atom is a
> subgraph of the target structure.
>
> This can be resolved by setting up a search tree, but then it needs to be
> more careful about backtracking and pruning, and that's too much work for
> an evening of programming.
>
> Cheers,
>
>
> Andrew
> da...@dalkescientific.com
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-23 Thread Maciek Wójcikowski
>
> >  which could of course also be changed to something expensive to
> calculate.
> Yes, that could be possible. Abstractly, let the first 20 bytes of each
> fingerprint be a salt, and use something like bcrypt so each fingerprint
> test requires that the query structure be re-fingerprinted for the
> per-fingerprint hash function.

I think salting is a must. If any mony is at stake, I'd suspect equally
computing power used to crack it. The closes analogy and walk-around for
the slow computing hashing are "rainbow tables" for strings. So instead of
computing the hash, you just need to look it up. Without salting such
lookup tables would not be that big i suppose. If you had such lookup
table, then you'd only need an algorithm (or GA) that builds a molecule
from a set of environments not randomly build it.


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2018-04-22 22:25 GMT+02:00 Andrew Dalke :

> On Apr 22, 2018, at 20:22, Nils Weskamp  wrote:
> > Actually, I *was* also thinking about your use cases 2 and 3 since you
> > also need some form of hash function to map substructures to bit
> > numbers. This is normally a rather simple function / pseudo random
> > generator,
>
> Strictly speaking, this is not a requirement.
>
> The term "fingerprint" has taken on quite an encompassing meaning since
> 1990.
>
> The molecular formula is a count fingerprint with 118 keys, based on the
> atomic number. There's no need for hash function there. "CCO" might be:
>   [0, 0, 0, 0, 0, 2, 0, 1, ...]
>
> Or it can be written in more compact form like {"C": 2, "O": 1}.
>
> As an alternative, I could use a mapping from canonical substructures to
> counts, so "CCO" becomes:
>
>   {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1}
>
> This doesn't require a hash. (While I represent that as a Python
> dictionary, which uses a hash table underneath, it could be implemented
> using a red-black tree or B-tree, or with a simple linear search.)
>
> It's only if I want to convert this into fixed length representation where
> I have to figure out some sort of encoding scheme.
>
> Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I
> could have a table which maps all canonical substructures to its bit
> pattern. If I have an unknown fragment, I could use RANDOM.ORG to get the
> bits.
>
> Downsides include potentially unbounded table growth and the need for a
> centralized table.
>
> This is the approach that Zatocoding used, and I see Chemical Zatocoding
> as the only precursor to Daylight hash fingerprints.
>
> >  which could of course also be changed to something expensive to
> calculate.
>
>
> Yes, that could be possible. Abstractly, let the first 20 bytes of each
> fingerprint be a salt, and use something like bcrypt so each fingerprint
> test requires that the query structure be re-fingerprinted for the
> per-fingerprint hash function.
>
> It would, however, take an absurdly long time to do a similarity search.
>
> And in any case, before going further along that path, we would need to
> figure out the risk model. Brian started by saying that he wanted to
> obfuscate molecules for security, but didn't say what he want to use them
> for, and if he want to secure them against nation-states, or simply against
> me. ;)
>
>
>
> Andrew
> da...@dalkescientific.com
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Andrew Dalke
On Apr 22, 2018, at 20:22, Nils Weskamp  wrote:
> Actually, I *was* also thinking about your use cases 2 and 3 since you
> also need some form of hash function to map substructures to bit
> numbers. This is normally a rather simple function / pseudo random
> generator, 

Strictly speaking, this is not a requirement.

The term "fingerprint" has taken on quite an encompassing meaning since 1990.

The molecular formula is a count fingerprint with 118 keys, based on the atomic 
number. There's no need for hash function there. "CCO" might be:
  [0, 0, 0, 0, 0, 2, 0, 1, ...]

Or it can be written in more compact form like {"C": 2, "O": 1}.

As an alternative, I could use a mapping from canonical substructures to 
counts, so "CCO" becomes:

  {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1}

This doesn't require a hash. (While I represent that as a Python dictionary, 
which uses a hash table underneath, it could be implemented using a red-black 
tree or B-tree, or with a simple linear search.)

It's only if I want to convert this into fixed length representation where I 
have to figure out some sort of encoding scheme.

Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I 
could have a table which maps all canonical substructures to its bit pattern. 
If I have an unknown fragment, I could use RANDOM.ORG to get the bits.

Downsides include potentially unbounded table growth and the need for a 
centralized table.

This is the approach that Zatocoding used, and I see Chemical Zatocoding as the 
only precursor to Daylight hash fingerprints.

>  which could of course also be changed to something expensive to calculate.


Yes, that could be possible. Abstractly, let the first 20 bytes of each 
fingerprint be a salt, and use something like bcrypt so each fingerprint test 
requires that the query structure be re-fingerprinted for the per-fingerprint 
hash function.

It would, however, take an absurdly long time to do a similarity search.

And in any case, before going further along that path, we would need to figure 
out the risk model. Brian started by saying that he wanted to obfuscate 
molecules for security, but didn't say what he want to use them for, and if he 
want to secure them against nation-states, or simply against me. ;)



Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Nils Weskamp
Hi Andrew,

Am 22.04.2018 um 19:35 schrieb Andrew Dalke:
> I think of what I did here as a bit more elegant than that. ;)

I should have have looked at the code more carefully before commenting.
;) Nevertheless, you will probably still need many steps for complex
structures - although not as many as I anticipated.

Actually, I *was* also thinking about your use cases 2 and 3 since you
also need some form of hash function to map substructures to bit
numbers. This is normally a rather simple function / pseudo random
generator, which could of course also be changed to something expensive
to calculate.

On a second thought, one would of course have to make sure that this
function cannot be pre-calculated for some lookup table, which might be
challenging.

Best regards,
Nils

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Andrew Dalke
On Apr 22, 2018, at 08:42, Nils Weskamp  wrote:
> Nice work. If brute-force approaches like this (or methods based on
> genetic algorithms etc.) are the only way to reverse a fingerprint, one
> could probably come up with a fingerprint that allows for pretty secure
> structure sharing by using many iterations of a strong cryptographic
> hash function that is really slow to calculate.

I usually think of "brute force" as something more like "enumerate all possible 
structures, generate the fingerprint for each one, and compare." I think of 
what I did here as a bit more elegant than that. ;)

I think of fingerprints as being applied to three uses:

1) identity test
if mol1 == mol2 then fp(mol1) == fp(mol2)
=> if fp(mol1) != fp(mol2) then mol1 != mol2

A good identity fingerprint has the properties that the false positive rate of
 fp(mol1) == fp(mol2) but mol1 != mol2
is low, and that working with fingerprints gives advantages over a graph 
isomorphism check.

2) Substructure screening
  if mol1 is a subgraph of mol2 then fp(mol1) is a subset of fp(mol2).
(There are different definitions of 'subset' for different fingerprint 
types.)

  An effective substructure screen fingerprint has the properties that:
  fp(mol1) is a subset of fp(mol2) => a higher likelihood that mol1 is a 
subgraph of mol2
   -and-
  fingerprint subset test is faster than subgraph isomorphism testing

3) Similarity comparison
  Use similarity of fp(mol1) and fp(mol2) as a proxy/estimate of the similarity 
of mol1 and mol2.

Usually we also assume that computing the similarity of two fingerprints is 
fast.


A cheminformatics fingerprint usually supports #1 and one or both of #2 or #3. 

If we only need #1, which is the use case Nils brought up, then we could use a 
SHA256 of the canonical SMILES string, or use the InChI Key. These are 
fixed-length binary fingerprints which can only be used for identity testing, 
and would give a low false positive rate.

The structure leakage comes from needing support for #2 and/or #3.

I don't see any reasonable way to make a fingerprint that can do #2 and/or #3 
without being open to some sort of enumeration scheme that is more clever than 
brute force.

Possibly some scheme related to homomorphic encryption might work? As I 
understand it, this would be unreasonable slow for what most people expect from 
fingerprints.

Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-21 Thread Nils Weskamp
Am 22.04.2018 um 03:04 schrieb Andrew Dalke:
> Here's an implementation of that sketch, applied to the RDKit hash 
> fingerprint:

Nice work. If brute-force approaches like this (or methods based on
genetic algorithms etc.) are the only way to reverse a fingerprint, one
could probably come up with a fingerprint that allows for pretty secure
structure sharing by using many iterations of a strong cryptographic
hash function that is really slow to calculate.

Nils

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-21 Thread Andrew Dalke
On Apr 21, 2018, at 01:55, Andrew Dalke  wrote:
> Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass 
> the screening test. If not, the structure contains no carbons, so repeat with 
> other elements until you find an atom which passes. Successively either add 
> an atom+bond or connect two existing atoms with a bond, fingerprint the 
> result, and do the screening test. If it does not pass then that modification 
> was not permitted. Use a breadth-first search which prioritizes branching and 
> rings to avoid chains longer than the maximum enumeration size.

Here's an implementation of that sketch, applied to the RDKit hash fingerprint:

  http://dalkescientific.com/rev_eng_fp.py

It works well for small structures:

% python rev_env_fp.py
No SMILES given. Using caffeine.
Current best guess is C=C with 2 bits of 759
Current best guess is Cc=O with 6 bits of 759
Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O

Here's aspirin:

% python rev_env_fp.py 'O=C(C)Oc1c1C(=O)O'
Found! CC(=O)Oc1c1C(=O)O

Capsicum is close, only missing a methyl in the tail.

% python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)C=CC(C)C'
Current best guess is CNC(=O)C=CC(C)C with 100 bits of 384
Current best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
Best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384


For omeprazole it only finds half of the structure:

% python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C'
Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863

For estradiol it gets stuck finding another cyclohexane instead of the 
cyclopentane:

% python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@@H]2O)O'
Current best guess is CC12CCC(O)C21C with 163 bits of 583
Current best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
Best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583


Note: it's currently set up to only consider the elements
  ["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"]

Edit the 'elements' list of you want to include more possibilities. This is 
more likely to run into a dead-end.


The current code assumes that when I grow by one atom, if fp(mol + new atom) is 
a subset of the target fingerprint, then mol + new_atom is a subgraph of the 
target structure.

This can be resolved by setting up a search tree, but then it needs to be more 
careful about backtracking and pruning, and that's too much work for an evening 
of programming.

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Andrew Dalke
On Apr 20, 2018, at 19:03, jeff godden  wrote:
> 
> Long ago molecular fingerprints were referred to in the literature as 
> molecular hash functions. (y'know, those crazy mathematical algorithms which 
> permitted rapid lookup of some string in a lookup table)

Do you have a reference or any more detail for that?

As far as I can tell, effectively everyone used fragment-based screens from the 
punched-card era of the 1940s up until the late 1980s. It wasn't until the 
early 1990s when the word "fingerprint" appears in the cheminformatics 
literature, in a paper by John Barnard referencing the Daylight fingerprints.

I haven't come across any use of molecular hash functions for fingerprint-like 
descriptors before the Daylight work, and I looked pretty hard for one.

Instead, I get the feeling that there was some attempt in the 1990s to 
distinguish between these two approaches, and then over time the term 
"fingerprint" took on a much broader meaning than "Daylight's binary 
fingerprint based on hash-encoding of enumerated linear subgraphs." That is, I 
think that "molecular hash functions" post-date "fingerprint".

FWIW, a pubs.acs.org search for "molecular hash" found only three papers; one 
from 2005 and the other two from 2015.

The closest exception to to an earlier Daylight-like fingerprint is the 
superimposed coding created by Mooers in the 1940s, mentioned in Ray and Kirsch 
(1957), put to use in Feldman and Hodes (1975), and further used at a couple of 
places since then.

These *are* connected; Mooers in his "Codes and Coding" entry from 
"Encyclopedia of Library and Information Science" remarks: A rudimentary form 
of use of random patterns was discovered by computer people about 10 years 
later for fast look-up in tables. This simpler form, which effectively uses 
only a single descriptor, is known in the computer industry as "hash coding."" 
Weininger and Mooers both drew from the same information theory concepts to 
develop fingerprints and superimposed coding, respectively.

But they aren't the same.

I found hashing used for other molecular-related topics, like WLN and IR 
spectra lookups, but these didn't seem to be *molecular* hash functions.


>  As such, we expected for their to be the associated hash collisions  
> (https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).

As Peter Shenkin pointed out, this isn't a given.

In the MMP code I helped develop (mmpdb - https://github.com/rdkit/mmpdb ), one 
of the novel features is the ability to match not just the pairs but the local 
attachment environment, based on the circular environment of r=1 up to r=5 from 
the attachment point.

I created a fingerprint for that based on the fingerprints for the individual 
circular environments, concatenated together, and then SHA256'ed to get a 
unique characteristic.

Unlike the Daylight approach, this fingerprint can only be used to check for 
identity. The requirement that a fingerprint be used for similarity and 
substructure screens makes them much larger than needed for a simple identity 
check.

And as Dave Cosgrove rightly pointed out, this extra information makes it 
possible to reverse engineer a (Daylight-style hash fingerprint) to find a 
molecular graph which is at least isospectral to the original structure.

Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass 
the screening test. If not, the structure contains no carbons, so repeat with 
other elements until you find an atom which passes. Successively either add an 
atom+bond or connect two existing atoms with a bond, fingerprint the result, 
and do the screening test. If it does not pass then that modification was not 
permitted. Use a breadth-first search which prioritizes branching and rings to 
avoid chains longer than the maximum enumeration size.

You'll also need to allow aromatic atoms in a non-ring so you can do the growth 
correctly. 

ECFP-style circular fingerprints are not designed for substructure screens so 
cannot be reverse-engineered this easily. It would be interested to try the GA 
method that Dave Cosgrove suggested.

I know of no papers concerning this topic, and I doubt that Dave Weininger ever 
published anything about it. He wasn't much into publishing in the scientific 
literature. 


Going back to the mmpdb environment fingerprint, it was also designed so that 
organizations can feel a bit more comfortable sharing MMP data with other 
organizations, since (like the InChI-Key) it's not possible to guess what an 
mmpdb environment fingerprint describes unless you 1) have it already, or 2) 
are willing to brute-force reasonable chemistry space to find it.

Interestingly, this use of "fingerprint" is more closely aligned with Rabin's 
1981 work on fingerprints - cryptographic hashes used for identify checking; 
see http://www.xmailserver.org/rabin.pdf - than with Daylight fingerprints.

When I asked a few years ago, Dave Weininger did not recall how they came up 
with the te

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Michal Krompiec
I’ve just done an analysis of frequency of hash collisions for Morgan
fingerprints, on a combinatorial library of 1M potential organic
semiconductors. To reduce collisions below 0.5% (meaning: >99.5%
fingerprints are unique), the radius has to be at least 5 (corresponding to
ECFP10) and number of bits needs to be at least 128. The same conclusion
was obtained on a random 250k subset. Folding to 256 bits or more (tried up
to 2048), or increasing the radius (up to 8) offered modest improvements.
Below 100 bits, frequency of collisions increased dramatically. So it was
possible to choose a small(ish) fingerprint which is quite unique, even
though it is a combinatorial library of rather similar compounds,
containing many sets of isomers etc.

Regards,
Michal

On Fri, 20 Apr 2018 at 19:02, David Cosgrove 
wrote:

> Hi Jeff,
> What you say is theoretically correct, in that it is probably not possible
> to go from the fingerprint directly to a structure. However, it is possible
> to generate structures and rapidly compare them to the target fingerprint.
> The fingerprints are of course able to tell you how close your structure is
> to the target fingerprint in a way that can drive an optimisation
> algorithm. Chemistry adds strong constraints to what structures are
> possible, which reduces the search space dramatically and if you know it’s
> a “drug-like” molecule you’re looking for, even more so.
> People forget that Daylight originally developed fingerprints to speed up
> substructural searching of databases. A structure can only be a
> substructure of another molecule if all the bits it sets are also in the
> other molecule. They are specifically designed to encode the molecular
> structure, and that’s why a GA can be successful. As Peter says, the same
> fingerprint can be generated for different molecules, but this will be rare
> if the fingerprint is well designed. Try it on Chembl with an RDKit
> fingerprint and I’ll be surprised if you get more than 10 pairs that aren’t
> isomers of each other or something trivial like that.
> Regards,
> Dave
>
> On Fri, 20 Apr 2018 at 18:49, Peter S. Shenkin  wrote:
>
>> Well, @jeff, there's no law saying that hashes must collide, and in fact
>> some are designed to make collision extremely unlikely (can you say
>> "SHA-2"?). But the ones in question here do collide relatively frequently,
>> for at least some molecular fingerprint types.
>>
>> An interesting question (maybe only to me :-) ) would be how similar, in
>> general, the structures are that exhibit identical fingerprints, for the
>> well-known fingerprint types, for various fingerprint lengths. A
>> sufficiently complicated molecule will give lots of on bits, and for (say)
>> a 64-fit fingerprint, there can only be 64 possible fingerprints with all
>> but one bit turned on.
>>
>> I realize that most fingerprints in common use today are longer than
>> this, but still, looking back at 64- and 32-bit fingerprints with all but
>> one bits on might give some insight. How short does a fingerprint of some
>> particular type have to be for, say, 10% of CHEMBL molecules to exhibit an
>> all-on pattern? How short does it have to be for, say, 10% of CHEMBL
>> molecules to have an exact fingerprint match with some other molecule?
>>
>> -P
>>
>> On Fri, Apr 20, 2018 at 1:03 PM, jeff godden  wrote:
>>
>>> Long ago molecular fingerprints were referred to in the literature as
>>> molecular hash functions. (y'know, those crazy mathematical algorithms
>>> which permitted rapid lookup of some string in a lookup table)  As such, we
>>> expected for their to be the associated hash collisions  (
>>> https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).  All
>>> this by way of saying that to go from fingerprint to the molecular
>>> structure which produced it is traditionally impossible unless the
>>> fingerprint no longer amounts to a hash(ing) function.
>>> --
>>> j
>>>
>>>
>>> On Fri, Apr 20, 2018 at 9:56 AM, Peter S. Shenkin 
>>> wrote:
>>>
 Isn't it the case that more than one molecule can share an identical
 fingerprint? (Depending on the specific fingerprint.) Think p-biphenyl,
 extended to triphenyl, tetraphenyl, etc. Still, a GA or SA method could
 keep going and come up with multiple matches, plus multiple near-misses.

 -P.

 On Fri, Apr 20, 2018 at 10:58 AM, David Cosgrove <
 davidacosgrov...@gmail.com> wrote:

> Hi Brian,
> Dave Weininger once showed a fairly simple GA that could generally
> deduce a structure from a daylight fingerprint by using SMILES strings as
> the chromosomes and tanimoto distance to the target fingerprint as the
> fitness function.  He may have done a talk about it for MUG or conceivably
> written it up. It’d be in JCICS if so, I expect.
>
> You could probably knock up a script to do that in a couple of hours I
> would think using a GA library to do the mechanics. If you’re not worried
> about high efficien

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread David Cosgrove
Hi Jeff,
What you say is theoretically correct, in that it is probably not possible
to go from the fingerprint directly to a structure. However, it is possible
to generate structures and rapidly compare them to the target fingerprint.
The fingerprints are of course able to tell you how close your structure is
to the target fingerprint in a way that can drive an optimisation
algorithm. Chemistry adds strong constraints to what structures are
possible, which reduces the search space dramatically and if you know it’s
a “drug-like” molecule you’re looking for, even more so.
People forget that Daylight originally developed fingerprints to speed up
substructural searching of databases. A structure can only be a
substructure of another molecule if all the bits it sets are also in the
other molecule. They are specifically designed to encode the molecular
structure, and that’s why a GA can be successful. As Peter says, the same
fingerprint can be generated for different molecules, but this will be rare
if the fingerprint is well designed. Try it on Chembl with an RDKit
fingerprint and I’ll be surprised if you get more than 10 pairs that aren’t
isomers of each other or something trivial like that.
Regards,
Dave

On Fri, 20 Apr 2018 at 18:49, Peter S. Shenkin  wrote:

> Well, @jeff, there's no law saying that hashes must collide, and in fact
> some are designed to make collision extremely unlikely (can you say
> "SHA-2"?). But the ones in question here do collide relatively frequently,
> for at least some molecular fingerprint types.
>
> An interesting question (maybe only to me :-) ) would be how similar, in
> general, the structures are that exhibit identical fingerprints, for the
> well-known fingerprint types, for various fingerprint lengths. A
> sufficiently complicated molecule will give lots of on bits, and for (say)
> a 64-fit fingerprint, there can only be 64 possible fingerprints with all
> but one bit turned on.
>
> I realize that most fingerprints in common use today are longer than this,
> but still, looking back at 64- and 32-bit fingerprints with all but one
> bits on might give some insight. How short does a fingerprint of some
> particular type have to be for, say, 10% of CHEMBL molecules to exhibit an
> all-on pattern? How short does it have to be for, say, 10% of CHEMBL
> molecules to have an exact fingerprint match with some other molecule?
>
> -P
>
> On Fri, Apr 20, 2018 at 1:03 PM, jeff godden  wrote:
>
>> Long ago molecular fingerprints were referred to in the literature as
>> molecular hash functions. (y'know, those crazy mathematical algorithms
>> which permitted rapid lookup of some string in a lookup table)  As such, we
>> expected for their to be the associated hash collisions  (
>> https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).  All
>> this by way of saying that to go from fingerprint to the molecular
>> structure which produced it is traditionally impossible unless the
>> fingerprint no longer amounts to a hash(ing) function.
>> --
>> j
>>
>>
>> On Fri, Apr 20, 2018 at 9:56 AM, Peter S. Shenkin 
>> wrote:
>>
>>> Isn't it the case that more than one molecule can share an identical
>>> fingerprint? (Depending on the specific fingerprint.) Think p-biphenyl,
>>> extended to triphenyl, tetraphenyl, etc. Still, a GA or SA method could
>>> keep going and come up with multiple matches, plus multiple near-misses.
>>>
>>> -P.
>>>
>>> On Fri, Apr 20, 2018 at 10:58 AM, David Cosgrove <
>>> davidacosgrov...@gmail.com> wrote:
>>>
 Hi Brian,
 Dave Weininger once showed a fairly simple GA that could generally
 deduce a structure from a daylight fingerprint by using SMILES strings as
 the chromosomes and tanimoto distance to the target fingerprint as the
 fitness function.  He may have done a talk about it for MUG or conceivably
 written it up. It’d be in JCICS if so, I expect.

 You could probably knock up a script to do that in a couple of hours I
 would think using a GA library to do the mechanics. If you’re not worried
 about high efficiency, you don’t need to do anything fancy with mutation
 and crossover of the SMILES strings to ensure you always get a valid
 molecule, you can just give a fitness of 0 if the SMILES parser doesn’t
 like what you give it.
 HTH,
 Dave


 On Fri, 20 Apr 2018 at 14:45, Nils Weskamp 
 wrote:

> Hi Brian,
>
> in general, it might be difficult to come up with a deterministic
> algorithm that generates exactly one structure for a given fingerprint due
> to many ambiguities in the process. If you are happy with a more "fuzzy"
> (approximate / probabilistic) approach, you might want to take a look at
>
> https://pubs.acs.org/doi/abs/10.1021/ci600383v
> https://link.springer.com/article/10.1007/s10822-005-9020-4
>
> Given this task, I would probably start with a large database of known
> compounds (PubChem, UniChem, GDB17), calculate

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread jeff godden
(getting dangerously old fart chatty here but) we crafted an in-house
molecular fingerprint once which was designed to hash out whether a
compound would've pissed off the high-throughput/organic-chemists or not.
(essentially anything with "exotic atoms" (like Boron?) or strained bonds
(like less than 120 degrees)).  so that fingerprint collided all of
chemical space into two bins.  "That's not a fingerprint!"...yeah, but it
fell right into the code alongside the fingerprints and was used as such.

Now, any bets are off on whether we used the HTS-fingerprint to _find_ or
exclude molecules [wink]

(ok returning to lurker-mode now)
-- 
j

On Fri, Apr 20, 2018 at 10:49 AM, Peter S. Shenkin 
wrote:

> Well, @jeff, there's no law saying that hashes must collide, and in fact
> some are designed to make collision extremely unlikely (can you say
> "SHA-2"?). But the ones in question here do collide relatively frequently,
> for at least some molecular fingerprint types.
>
> An interesting question (maybe only to me :-) ) would be how similar, in
> general, the structures are that exhibit identical fingerprints, for the
> well-known fingerprint types, for various fingerprint lengths. A
> sufficiently complicated molecule will give lots of on bits, and for (say)
> a 64-fit fingerprint, there can only be 64 possible fingerprints with all
> but one bit turned on.
>
> I realize that most fingerprints in common use today are longer than this,
> but still, looking back at 64- and 32-bit fingerprints with all but one
> bits on might give some insight. How short does a fingerprint of some
> particular type have to be for, say, 10% of CHEMBL molecules to exhibit an
> all-on pattern? How short does it have to be for, say, 10% of CHEMBL
> molecules to have an exact fingerprint match with some other molecule?
>
> -P
>
> On Fri, Apr 20, 2018 at 1:03 PM, jeff godden  wrote:
>
>> Long ago molecular fingerprints were referred to in the literature as
>> molecular hash functions. (y'know, those crazy mathematical algorithms
>> which permitted rapid lookup of some string in a lookup table)  As such, we
>> expected for their to be the associated hash collisions  (
>> https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).  All
>> this by way of saying that to go from fingerprint to the molecular
>> structure which produced it is traditionally impossible unless the
>> fingerprint no longer amounts to a hash(ing) function.
>> --
>> j
>>
>>
>> On Fri, Apr 20, 2018 at 9:56 AM, Peter S. Shenkin 
>> wrote:
>>
>>> Isn't it the case that more than one molecule can share an identical
>>> fingerprint? (Depending on the specific fingerprint.) Think p-biphenyl,
>>> extended to triphenyl, tetraphenyl, etc. Still, a GA or SA method could
>>> keep going and come up with multiple matches, plus multiple near-misses.
>>>
>>> -P.
>>>
>>> On Fri, Apr 20, 2018 at 10:58 AM, David Cosgrove <
>>> davidacosgrov...@gmail.com> wrote:
>>>
 Hi Brian,
 Dave Weininger once showed a fairly simple GA that could generally
 deduce a structure from a daylight fingerprint by using SMILES strings as
 the chromosomes and tanimoto distance to the target fingerprint as the
 fitness function.  He may have done a talk about it for MUG or conceivably
 written it up. It’d be in JCICS if so, I expect.

 You could probably knock up a script to do that in a couple of hours I
 would think using a GA library to do the mechanics. If you’re not worried
 about high efficiency, you don’t need to do anything fancy with mutation
 and crossover of the SMILES strings to ensure you always get a valid
 molecule, you can just give a fitness of 0 if the SMILES parser doesn’t
 like what you give it.
 HTH,
 Dave


 On Fri, 20 Apr 2018 at 14:45, Nils Weskamp 
 wrote:

> Hi Brian,
>
> in general, it might be difficult to come up with a deterministic
> algorithm that generates exactly one structure for a given fingerprint due
> to many ambiguities in the process. If you are happy with a more "fuzzy"
> (approximate / probabilistic) approach, you might want to take a look at
>
> https://pubs.acs.org/doi/abs/10.1021/ci600383v
> https://link.springer.com/article/10.1007/s10822-005-9020-4
>
> Given this task, I would probably start with a large database of known
> compounds (PubChem, UniChem, GDB17), calculate fingerprints and then do a
> similarity search with my query fingerprint.
>
> Hope this helps,
> Nils
>
>
> On Fri, Apr 20, 2018 at 3:13 PM, Brian Cole  wrote:
>
>> Hi Chem-informaticians:
>>
>> I know it has been talked about in the community that fingerprints
>> are not a way to obfuscate molecules for security, but I don't recall a
>> paper actually demonstrating actual reverse engineering a fingerprint 
>> into
>> a chemical structure. Does anyone know if such a paper exists?
>>
>> Code using RDK

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Peter S. Shenkin
Well, @jeff, there's no law saying that hashes must collide, and in fact
some are designed to make collision extremely unlikely (can you say
"SHA-2"?). But the ones in question here do collide relatively frequently,
for at least some molecular fingerprint types.

An interesting question (maybe only to me :-) ) would be how similar, in
general, the structures are that exhibit identical fingerprints, for the
well-known fingerprint types, for various fingerprint lengths. A
sufficiently complicated molecule will give lots of on bits, and for (say)
a 64-fit fingerprint, there can only be 64 possible fingerprints with all
but one bit turned on.

I realize that most fingerprints in common use today are longer than this,
but still, looking back at 64- and 32-bit fingerprints with all but one
bits on might give some insight. How short does a fingerprint of some
particular type have to be for, say, 10% of CHEMBL molecules to exhibit an
all-on pattern? How short does it have to be for, say, 10% of CHEMBL
molecules to have an exact fingerprint match with some other molecule?

-P

On Fri, Apr 20, 2018 at 1:03 PM, jeff godden  wrote:

> Long ago molecular fingerprints were referred to in the literature as
> molecular hash functions. (y'know, those crazy mathematical algorithms
> which permitted rapid lookup of some string in a lookup table)  As such, we
> expected for their to be the associated hash collisions  (
> https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).  All
> this by way of saying that to go from fingerprint to the molecular
> structure which produced it is traditionally impossible unless the
> fingerprint no longer amounts to a hash(ing) function.
> --
> j
>
>
> On Fri, Apr 20, 2018 at 9:56 AM, Peter S. Shenkin 
> wrote:
>
>> Isn't it the case that more than one molecule can share an identical
>> fingerprint? (Depending on the specific fingerprint.) Think p-biphenyl,
>> extended to triphenyl, tetraphenyl, etc. Still, a GA or SA method could
>> keep going and come up with multiple matches, plus multiple near-misses.
>>
>> -P.
>>
>> On Fri, Apr 20, 2018 at 10:58 AM, David Cosgrove <
>> davidacosgrov...@gmail.com> wrote:
>>
>>> Hi Brian,
>>> Dave Weininger once showed a fairly simple GA that could generally
>>> deduce a structure from a daylight fingerprint by using SMILES strings as
>>> the chromosomes and tanimoto distance to the target fingerprint as the
>>> fitness function.  He may have done a talk about it for MUG or conceivably
>>> written it up. It’d be in JCICS if so, I expect.
>>>
>>> You could probably knock up a script to do that in a couple of hours I
>>> would think using a GA library to do the mechanics. If you’re not worried
>>> about high efficiency, you don’t need to do anything fancy with mutation
>>> and crossover of the SMILES strings to ensure you always get a valid
>>> molecule, you can just give a fitness of 0 if the SMILES parser doesn’t
>>> like what you give it.
>>> HTH,
>>> Dave
>>>
>>>
>>> On Fri, 20 Apr 2018 at 14:45, Nils Weskamp 
>>> wrote:
>>>
 Hi Brian,

 in general, it might be difficult to come up with a deterministic
 algorithm that generates exactly one structure for a given fingerprint due
 to many ambiguities in the process. If you are happy with a more "fuzzy"
 (approximate / probabilistic) approach, you might want to take a look at

 https://pubs.acs.org/doi/abs/10.1021/ci600383v
 https://link.springer.com/article/10.1007/s10822-005-9020-4

 Given this task, I would probably start with a large database of known
 compounds (PubChem, UniChem, GDB17), calculate fingerprints and then do a
 similarity search with my query fingerprint.

 Hope this helps,
 Nils


 On Fri, Apr 20, 2018 at 3:13 PM, Brian Cole  wrote:

> Hi Chem-informaticians:
>
> I know it has been talked about in the community that fingerprints are
> not a way to obfuscate molecules for security, but I don't recall a paper
> actually demonstrating actual reverse engineering a fingerprint into a
> chemical structure. Does anyone know if such a paper exists?
>
> Code using RDKit to demonstrate the functionality would be an obvious
> bonus as well. :-)
>
> Thanks,
> Brian
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
 __

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread jeff godden
Long ago molecular fingerprints were referred to in the literature as
molecular hash functions. (y'know, those crazy mathematical algorithms
which permitted rapid lookup of some string in a lookup table)  As such, we
expected for their to be the associated hash collisions  (
https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).  All this
by way of saying that to go from fingerprint to the molecular structure
which produced it is traditionally impossible unless the fingerprint no
longer amounts to a hash(ing) function.
-- 
j


On Fri, Apr 20, 2018 at 9:56 AM, Peter S. Shenkin  wrote:

> Isn't it the case that more than one molecule can share an identical
> fingerprint? (Depending on the specific fingerprint.) Think p-biphenyl,
> extended to triphenyl, tetraphenyl, etc. Still, a GA or SA method could
> keep going and come up with multiple matches, plus multiple near-misses.
>
> -P.
>
> On Fri, Apr 20, 2018 at 10:58 AM, David Cosgrove <
> davidacosgrov...@gmail.com> wrote:
>
>> Hi Brian,
>> Dave Weininger once showed a fairly simple GA that could generally deduce
>> a structure from a daylight fingerprint by using SMILES strings as the
>> chromosomes and tanimoto distance to the target fingerprint as the fitness
>> function.  He may have done a talk about it for MUG or conceivably written
>> it up. It’d be in JCICS if so, I expect.
>>
>> You could probably knock up a script to do that in a couple of hours I
>> would think using a GA library to do the mechanics. If you’re not worried
>> about high efficiency, you don’t need to do anything fancy with mutation
>> and crossover of the SMILES strings to ensure you always get a valid
>> molecule, you can just give a fitness of 0 if the SMILES parser doesn’t
>> like what you give it.
>> HTH,
>> Dave
>>
>>
>> On Fri, 20 Apr 2018 at 14:45, Nils Weskamp 
>> wrote:
>>
>>> Hi Brian,
>>>
>>> in general, it might be difficult to come up with a deterministic
>>> algorithm that generates exactly one structure for a given fingerprint due
>>> to many ambiguities in the process. If you are happy with a more "fuzzy"
>>> (approximate / probabilistic) approach, you might want to take a look at
>>>
>>> https://pubs.acs.org/doi/abs/10.1021/ci600383v
>>> https://link.springer.com/article/10.1007/s10822-005-9020-4
>>>
>>> Given this task, I would probably start with a large database of known
>>> compounds (PubChem, UniChem, GDB17), calculate fingerprints and then do a
>>> similarity search with my query fingerprint.
>>>
>>> Hope this helps,
>>> Nils
>>>
>>>
>>> On Fri, Apr 20, 2018 at 3:13 PM, Brian Cole  wrote:
>>>
 Hi Chem-informaticians:

 I know it has been talked about in the community that fingerprints are
 not a way to obfuscate molecules for security, but I don't recall a paper
 actually demonstrating actual reverse engineering a fingerprint into a
 chemical structure. Does anyone know if such a paper exists?

 Code using RDKit to demonstrate the functionality would be an obvious
 bonus as well. :-)

 Thanks,
 Brian

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
>>> _
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> --
>> David Cosgrove
>> Freelance computational chemistry and chemoinformatics developer
>> http://cozchemix.co.uk
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Peter S. Shenkin
Isn't it the case that more than one molecule can share an identical
fingerprint? (Depending on the specific fingerprint.) Think p-biphenyl,
extended to triphenyl, tetraphenyl, etc. Still, a GA or SA method could
keep going and come up with multiple matches, plus multiple near-misses.

-P.

On Fri, Apr 20, 2018 at 10:58 AM, David Cosgrove  wrote:

> Hi Brian,
> Dave Weininger once showed a fairly simple GA that could generally deduce
> a structure from a daylight fingerprint by using SMILES strings as the
> chromosomes and tanimoto distance to the target fingerprint as the fitness
> function.  He may have done a talk about it for MUG or conceivably written
> it up. It’d be in JCICS if so, I expect.
>
> You could probably knock up a script to do that in a couple of hours I
> would think using a GA library to do the mechanics. If you’re not worried
> about high efficiency, you don’t need to do anything fancy with mutation
> and crossover of the SMILES strings to ensure you always get a valid
> molecule, you can just give a fitness of 0 if the SMILES parser doesn’t
> like what you give it.
> HTH,
> Dave
>
>
> On Fri, 20 Apr 2018 at 14:45, Nils Weskamp  wrote:
>
>> Hi Brian,
>>
>> in general, it might be difficult to come up with a deterministic
>> algorithm that generates exactly one structure for a given fingerprint due
>> to many ambiguities in the process. If you are happy with a more "fuzzy"
>> (approximate / probabilistic) approach, you might want to take a look at
>>
>> https://pubs.acs.org/doi/abs/10.1021/ci600383v
>> https://link.springer.com/article/10.1007/s10822-005-9020-4
>>
>> Given this task, I would probably start with a large database of known
>> compounds (PubChem, UniChem, GDB17), calculate fingerprints and then do a
>> similarity search with my query fingerprint.
>>
>> Hope this helps,
>> Nils
>>
>>
>> On Fri, Apr 20, 2018 at 3:13 PM, Brian Cole  wrote:
>>
>>> Hi Chem-informaticians:
>>>
>>> I know it has been talked about in the community that fingerprints are
>>> not a way to obfuscate molecules for security, but I don't recall a paper
>>> actually demonstrating actual reverse engineering a fingerprint into a
>>> chemical structure. Does anyone know if such a paper exists?
>>>
>>> Code using RDKit to demonstrate the functionality would be an obvious
>>> bonus as well. :-)
>>>
>>> Thanks,
>>> Brian
>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
>> _
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread David Cosgrove
Hi Brian,
Dave Weininger once showed a fairly simple GA that could generally deduce a
structure from a daylight fingerprint by using SMILES strings as the
chromosomes and tanimoto distance to the target fingerprint as the fitness
function.  He may have done a talk about it for MUG or conceivably written
it up. It’d be in JCICS if so, I expect.

You could probably knock up a script to do that in a couple of hours I
would think using a GA library to do the mechanics. If you’re not worried
about high efficiency, you don’t need to do anything fancy with mutation
and crossover of the SMILES strings to ensure you always get a valid
molecule, you can just give a fitness of 0 if the SMILES parser doesn’t
like what you give it.
HTH,
Dave


On Fri, 20 Apr 2018 at 14:45, Nils Weskamp  wrote:

> Hi Brian,
>
> in general, it might be difficult to come up with a deterministic
> algorithm that generates exactly one structure for a given fingerprint due
> to many ambiguities in the process. If you are happy with a more "fuzzy"
> (approximate / probabilistic) approach, you might want to take a look at
>
> https://pubs.acs.org/doi/abs/10.1021/ci600383v
> https://link.springer.com/article/10.1007/s10822-005-9020-4
>
> Given this task, I would probably start with a large database of known
> compounds (PubChem, UniChem, GDB17), calculate fingerprints and then do a
> similarity search with my query fingerprint.
>
> Hope this helps,
> Nils
>
>
> On Fri, Apr 20, 2018 at 3:13 PM, Brian Cole  wrote:
>
>> Hi Chem-informaticians:
>>
>> I know it has been talked about in the community that fingerprints are
>> not a way to obfuscate molecules for security, but I don't recall a paper
>> actually demonstrating actual reverse engineering a fingerprint into a
>> chemical structure. Does anyone know if such a paper exists?
>>
>> Code using RDKit to demonstrate the functionality would be an obvious
>> bonus as well. :-)
>>
>> Thanks,
>> Brian
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Nils Weskamp
Hi Brian,

in general, it might be difficult to come up with a deterministic algorithm
that generates exactly one structure for a given fingerprint due to many
ambiguities in the process. If you are happy with a more "fuzzy"
(approximate / probabilistic) approach, you might want to take a look at

https://pubs.acs.org/doi/abs/10.1021/ci600383v
https://link.springer.com/article/10.1007/s10822-005-9020-4

Given this task, I would probably start with a large database of known
compounds (PubChem, UniChem, GDB17), calculate fingerprints and then do a
similarity search with my query fingerprint.

Hope this helps,
Nils


On Fri, Apr 20, 2018 at 3:13 PM, Brian Cole  wrote:

> Hi Chem-informaticians:
>
> I know it has been talked about in the community that fingerprints are not
> a way to obfuscate molecules for security, but I don't recall a paper
> actually demonstrating actual reverse engineering a fingerprint into a
> chemical structure. Does anyone know if such a paper exists?
>
> Code using RDKit to demonstrate the functionality would be an obvious
> bonus as well. :-)
>
> Thanks,
> Brian
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Brian Cole
Hi Chem-informaticians:

I know it has been talked about in the community that fingerprints are not
a way to obfuscate molecules for security, but I don't recall a paper
actually demonstrating actual reverse engineering a fingerprint into a
chemical structure. Does anyone know if such a paper exists?

Code using RDKit to demonstrate the functionality would be an obvious bonus
as well. :-)

Thanks,
Brian
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss