Re: [Rdkit-discuss] About the original order algorithm of GetMorganFingerprintAsBitVect function

2022-07-25 Thread Nils Weskamp
Dear Yuzhi, a high-level discussion of the various fingerprints in the RDKit can be found at https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf If you want to know more, I would suggest to take a look at Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J.

Re: [Rdkit-discuss] about SMILES

2022-06-13 Thread Nils Weskamp
Dear Jean-Marc, I am not entirely sure I understand what you mean with "insert atom data inside of a chain". There are a number of proprietary extensions of smiles, such as e.g. CXSMILES https://docs.chemaxon.com/display/docs/chemaxon-extended-smiles-and-smarts-cxsmiles-and-cxsmarts.md that

[Rdkit-discuss] Post Doc - Machine Learning in Drug Discovery

2022-05-23 Thread Nils Weskamp
Dear All, at Boehringer Ingelheim, we are currently searching for enthusiastic researchers interested in applying Machine Learning to Drug Discovery, in particular by making use of multi-task learning approaches and by integrating meta-information into the process. Obviously, RDKit will play an

Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-13 Thread Nils Weskamp
Hello Giovanni, have you tried using the ForwardSDMolSupplier with sanitize = False and / or strictParsing = False ? This should at least reduce the number of cases where molecules are not accepted. You would then have to sanitize the structures yourself afterwards and handle possible

Re: [Rdkit-discuss] What is the most efficient way to check for exact match with RDKit?

2021-10-05 Thread Nils Weskamp
Dear Theo, it might be useful to describe your specific application scenario a bit more to provide some context. What do you want to do and how would "efficient" look like? One advantage of using InChiKeys is that they have a fixed length and can therefore be stored and indexed efficiently

Re: [Rdkit-discuss] Explaining bits from Morgan Fingerprints

2021-07-15 Thread Nils Weskamp
Hi gyro, if I understand you correctly, you would like to generate a "fingerprint" completely independent of a molecule (i.e., "out of thin air") and then find out how a corresponding molecule would have to look like? If you are really only interested in a specific bit, I would probably

Re: [Rdkit-discuss] RDKit: generate fingerprints from ZINC database for cluster analysis

2021-06-29 Thread Nils Weskamp
Hi Francesca, technically, it should be possible to read MOL2 files with RDKit (and to convert the structures into SDF, SMILES etc.) I found https://chem-workflows.com/articles/2020/03/23/building-a-multi-molecule-mol2-reader-for-rdkit-v2/ as one example. Having said that, I'm wondering

Re: [Rdkit-discuss] RDKfingerprint function

2021-05-18 Thread Nils Weskamp
d description. Best, Nils Am 18.05.2021 um 12:27 schrieb דין עזרא: Hi Nils, Thanks for your mail ! So if I understand correctly, the function of the fingerprint is related to the topological fingerprint in the document? Thanks, Din On 18 May 2021, 13:10 +0300, Nils Weskamp , wrote: There may be

Re: [Rdkit-discuss] RDKfingerprint function

2021-05-18 Thread Nils Weskamp
Hi Din, this (old) presentation from a UGM might be a starting point: Landrum_RDKit_UGM.Fingerprints.Final.pptx There may be more recent sources available. Hope this helps, Nils ‪On Tue, May 18, 2021 at 11:44 AM

Re: [Rdkit-discuss] Some basic questions about binary fingerprints

2021-01-09 Thread Nils Weskamp
Dear Jan, you are probably right. If you have about 2/3 of your 10k bits set to one, doesn't that imply the probability of a collision for any new fragment is roughly 2/3 (which fits to the 5 of 7 you observe in your example)? Concerning your second question: Just as any other descriptor,

Re: [Rdkit-discuss] Polar surface area unit

2021-01-08 Thread Nils Weskamp
Dear Navid, (T)PSA is typically measured in angstroms squared (A^2). You may also want to have a look at https://peter-ertl.com/reprints/Ertl-JMC-43-3714-2000.pdf Best, Nils Am 08.01.2021 um 15:46 schrieb Navid Shervani-Tabar: Dear RDKiters, I was wondering what is the unit for the

Re: [Rdkit-discuss] From MW to structure

2021-01-07 Thread Nils Weskamp
Dear Stephane, you may want to take a look at this older thread: https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/thread/8b455d6f-7817-5046-1f72-449954132621%40gmx.net/#msg37036275 Starting from just a molecular weight instead of a sum formula is probably not making things easier. Hope

Re: [Rdkit-discuss] Dragon fingerprints?

2020-11-10 Thread Nils Weskamp
Hello Michal, you are probably referring to the Dragon Descriptors (https://chm.kode-solutions.net/products_dragon.php, now called alvaDesc)? That is a pretty comprehensive set of more than 5.000 descriptors and I would be surprised of someone had (re-)implemented all of them. The closest

Re: [Rdkit-discuss] c++ atomic lifetime

2020-08-27 Thread Nils Weskamp
To add to this: you are looking at the wonderful concept of an "undefined behavior" in C/C++. There is no guarantee that your example program will always show the same behaviour. In more recent versions of C++, you have access to "smart pointers" like std::shared_ptr, which basically implement

Re: [Rdkit-discuss] Random structure generator based on chemical formula?

2020-06-13 Thread Nils Weskamp
Hi Theo, this kind of structure elucidation is unfortunately not as trivial as it may sound. Not based on RDKit, but maybe still worth looking at: https://pubmed.ncbi.nlm.nih.gov/22985496/ https://sourceforge.net/projects/openmg/ Hope this helps, Nils Am 13.06.2020 um 10:54 schrieb theozh: >

Re: [Rdkit-discuss] Synthetic Accessibility (SA) score

2020-04-01 Thread Nils Weskamp
Hi Ganesh, I would like to challenge your premise. Why do you think that synthetic accessibility should add up like that? Theoretically, I would expect that the combination of A,B and C to ABC will require some synthetic effort - so should be SA(A) + SA(B) + SA(C) < SA(ABC). Technically, the

Re: [Rdkit-discuss] Doing substructure search as quickly as possible...

2020-02-10 Thread Nils Weskamp
Hi Alexis, if you go down that route and calculate artifical skeletons, you could also go all the way and use an algorithm like HierS [1] or the scaffold tree [2] to perform a recursive fragmentation of your queries and molecules into their various rings and ring systems. If a query contains a

Re: [Rdkit-discuss] citation for the daylight fingerprint

2019-02-15 Thread Nils Weskamp
Dear Mario, for the original Daylight fingerprints, I would cite the "Daylight Theory Manual" at http://www.daylight.com/dayhtml/doc/theory/ Best, Nils On Fri, Feb 15, 2019 at 11:48 AM Mario Lovrić wrote: > Dear all, > > I am looking for the original citation of the RDKit fingerprint. > >

Re: [Rdkit-discuss] RDK5 fingerprint

2018-10-04 Thread Nils Weskamp
Am 04.10.2018 um 20:53 schrieb Thomas Evangelidis: > not sure if significantly longer path lengths (e.g. 12) actually > "increase the amount of information" since they also increase the risk > of bit collisions in folded fingerprints. > > If you increase the fpSize to 8192, won't you

Re: [Rdkit-discuss] RDK5 fingerprint

2018-10-04 Thread Nils Weskamp
swer to my original question. > > Thomas > > > > > On Thu, 4 Oct 2018 at 11:28, Nils Weskamp <mailto:nils.wesk...@gmail.com>> wrote: > > Hi Thomas, > > is there a particular reason why you want to use the > RDK5-fingerprints? My impres

Re: [Rdkit-discuss] RDK5 fingerprint

2018-10-04 Thread Nils Weskamp
Hi Thomas, is there a particular reason why you want to use the RDK5-fingerprints? My impression was always that circular (Morgan) fingerprints generate better results than the path-oriented RDK-fingerprints. Best, Nils On Thu, Oct 4, 2018 at 11:22 AM Thomas Evangelidis wrote: > Dear RDKit

Re: [Rdkit-discuss] Are atom and bond indexes deterministic?

2018-10-02 Thread Nils Weskamp
Hi Peter, to the best of my knowledge: for a given SMILES string, you should always end up with the same molecule object. On the other hand, generation of (canonical / unique) SMILES often reorders atoms and bonds (to ensure that the SMILES is unique for a given structure). A conversion Molecule

Re: [Rdkit-discuss] optimizing substructure search

2018-08-18 Thread Nils Weskamp
Dear Alexis, the concept you are describing is pretty much exactly the reason why molecular keys / fingerprints were invented in the first place. I would suggest to take a look at the RDKit database cartridge (https://www.rdkit.org/docs/Cartridge.html) since that should basically do what you want

Re: [Rdkit-discuss] Tanimoto Similarity

2018-07-04 Thread Nils Weskamp
Dear Phuong, unfortunately, there is no generic answer to this question since it is highly dependent on the fingerprint, the type of compounds, your specific application and also your chemical intuition. I can only recommend to test a range of different cutoff values and to see how happy you are

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Nils Weskamp
Hi Andrew, Am 22.04.2018 um 19:35 schrieb Andrew Dalke: > I think of what I did here as a bit more elegant than that. ;) I should have have looked at the code more carefully before commenting. ;) Nevertheless, you will probably still need many steps for complex structures - although not as many

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Nils Weskamp
Am 22.04.2018 um 03:04 schrieb Andrew Dalke: > Here's an implementation of that sketch, applied to the RDKit hash > fingerprint: Nice work. If brute-force approaches like this (or methods based on genetic algorithms etc.) are the only way to reverse a fingerprint, one could probably come up with

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Nils Weskamp
Hi Brian, in general, it might be difficult to come up with a deterministic algorithm that generates exactly one structure for a given fingerprint due to many ambiguities in the process. If you are happy with a more "fuzzy" (approximate / probabilistic) approach, you might want to take a look at

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Nils Weskamp
Hi Michal, I have done this a couple of times for compound sets up to 10M+ using a simplified variant of the Taylor-Butina algorithm. The overall run time was in the range of hours to a few days (which could probably be optimized, but was fast enough for me). As you correctly mentioned, getting

Re: [Rdkit-discuss] RDKit-fingerprints set all bits for complex molecules?

2017-06-02 Thread Nils Weskamp
> 2048).GetNumOffBits() > Out[13]: 786 > > In [14]: Chem.RDKFingerprint(m,maxPath=6,branchedPaths=False,fpSize= > 2048).GetNumOffBits() > Out[14]: 1145 > > In [15]: Chem.RDKFingerprint(m,maxPath=5,branchedPaths=False,fpSize= > 2048).GetNumOffBits() > Out[15]: 1460 >

Re: [Rdkit-discuss] RDKit-fingerprints set all bits for complex molecules?

2017-06-01 Thread Nils Weskamp
and...@gmail.com> wrote: > Hi Nils, > > Can you please send me the SMILES for those structures (or point me to an > easy way to lookup a SCHEMBL id)? > > I will take a look at these, but I don't currently have a convenient copy > of SCHEMBL. > > -greg > > &

Re: [Rdkit-discuss] RDKit-fingerprints set all bits for complex molecules?

2017-06-01 Thread Nils Weskamp
higher > value than the default 2048 to see if you can get one with 0's? > > Cheers, > Bruce > > > > Message: 2 > > Date: Thu, 1 Jun 2017 16:28:40 +0200 > > From: Nils Weskamp <nils.wesk...@gmail.com> > > To: Rdkit-discuss@lists.sourceforge.n

[Rdkit-discuss] RDKit-fingerprints set all bits for complex molecules?

2017-06-01 Thread Nils Weskamp
Dear RDKitters, I just calculated RDKit "Daylight-like" fingerprints for a number of public compound databases and found quite a number of examples where the resulting fingerprints have *all* bits set to 1. This happens in both KNIME 3.2.1 (1024/1/7) and also via the command line (2048/1/7/4) for

Re: [Rdkit-discuss] Fast similarity search

2017-05-18 Thread Nils Weskamp
Hi Tim, according to https://www.knime.org/files/01_greg_landrum.pdf, the PostgreSQL cartridge can compare ~1 million compounds/sec on a single CPU (and this talk is from 2011). ChemFP is much faster if you pre-load all your FPs into main memory. Hope this helps, Nils Am 18.05.2017 um 23:15

Re: [Rdkit-discuss] official Tripos MOL2 file format PDF document

2017-04-13 Thread Nils Weskamp
Dear All, does this link help: https://www.yumpu.com/en/document/view/15425101/tripos-mol2-file-format Cheers, Nils On Thu, Apr 13, 2017 at 4:21 PM, Hannes Loeffler wrote: > On Tue, 11 Apr 2017 08:35:53 -0500 > Francois BERENGER