Re: [Rdkit-discuss] use cases for weighted sampling of a compound library

2022-12-12 Thread Stephen Pickett via Rdkit-discuss
The combi chem design literature has a lot on this topic. Designing the library 
to match a desired profile.
Here is on to get started. https://pubs.acs.org/doi/full/10.1021/ci980332b  
Gillet et al. J. Chem. Inf. Comput. Sci. 1999, 39, 1, 169-177. In this case 
optimising the RMS to the desired profile(s).

Stephen

-Original Message-

Message: 3
Date: Sun, 11 Dec 2022 13:58:41 -0600
From: Rocco Moretti 
To: Christopher Mayer-Bacon 
Cc: RDKit Discuss 
Subject: Re: [Rdkit-discuss] use cases for weighted sampling of a
compound library
Message-ID:

Content-Type: text/plain; charset="utf-8"

The use case for this sort of thing which immediately springs to mind would be 
decoy selection. That is, you have a known set of "positives" and want to find 
a set of "negatives"/"background" which match those compounds in some set of 
properties. DUD-E 
is probably the most well-known example of this, but it's an approach which 
has been tried numerous times on both a formal and an ad hoc basis.

These days where you'll see people attempting it would be with machine 
learning, to try to get around the "positive/unlabeled" nature of most small 
molecule datasets. That is, often it's easy to get sets of "positive"
compounds from the literature/etc., but trying to get a set of known negative 
compounds is sometimes difficult. People attempt to find synthetic negatives by 
using matched-property decoy selection from an external compound set. However, 
the literature on this is ... less than flattering.
It turns out that even if you're careful, your negative selection method can 
still be biased and the ML method can pick up on this (see, e.g.
https://urldefense.com/v3/__https://doi.org/10.1021/acs.jcim.8b00712__;!!AoaiBx6H!1H6QOzX4JP_IMa0kjkjUxIT3-NuBl2Tt1zQpPXWJ876xfwmRdv65jnFyR9mVRt-6YWez8rJ1rdlwuQMgrG6BuyryYsufISDt6dwdYogi$
 ) -- this is similar to the tales of image recognition software which uses the 
presence of grass to tell if something is a cow or not, or which fails because 
all the pictures of one kind of tank were taken on a cloudy day. ML is very 
good at picking up such small differences, even if you don't know what those 
actually are.

On Sun, Dec 11, 2022 at 11:25 AM Christopher Mayer-Bacon 
wrote:

> Hello all,
>
> I?m starting a project that explores the sampling of a large compound
> library.  My question is not so much about how to do something, but
> rather the specific use cases for weighted sampling from a compound library.
>
> Given a large compound library and a smaller, reference library, I
> want to take random samples from the large library such that the
> samples resemble the reference library in some way.  At the moment I?m
> focused on element composition (% of carbon atoms, % of oxygen atoms,
> etc.), but I?m open to using other features in the future.
>
> I have an idea of how to perform this sampling; my question for this
> community concerns a possible use case.  What would be the benefit of
> sampling from a compound library such that the samples resemble
> another library in some way?  I can think of a use case for my
> specific research niche (adaptive properties of the canonical amino
> acid alphabet), but I can?t think of another potential use case.  I
> know the RDKit community has a wide variety of backgrounds and
> expertise, hence why I wanted to pose this question to you all.
>
> -Chris
>
> --
> -Christopher Mayer-Bacon (*he/him/his*) PhD student Department of
> Biological Sciences University of Maryland, Baltimore County
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listin
> fo/rdkit-discuss__;!!AoaiBx6H!1H6QOzX4JP_IMa0kjkjUxIT3-NuBl2Tt1zQpPXWJ
> 876xfwmRdv65jnFyR9mVRt-6YWez8rJ1rdlwuQMgrG6BuyryYsufISDt6SffBT3Q$
>
-- next part --
An HTML attachment was scrubbed...

--



--

Subject: Digest Footer

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/rdkit-discuss__;!!AoaiBx6H!1H6QOzX4JP_IMa0kjkjUxIT3-NuBl2Tt1zQpPXWJ876xfwmRdv65jnFyR9mVRt-6YWez8rJ1rdlwuQMgrG6BuyryYsufISDt6SffBT3Q$


--

End of Rdkit-discuss Digest, Vol 182, Issue 6
*
GSK monitors email communications sent to and from GSK in order to protect GSK, 
our employees, customers, suppliers and business partners, from cyber threats 
and loss of GSK Information. GSK monitoring is conducted with appropriate 
confidentiality controls and in accordance with local laws and after 
appropriate consultation.


[Rdkit-discuss] DeleteSubstructs issue

2021-06-03 Thread Stephen Pickett via Rdkit-discuss
Hi

There appears to be an issue with the DeleteSubstructs method when deleting 
groups from an aromatic N. The H-count is not reset properly leading to a 
kekulise error.
The workaround is to Kekulise the molecule first. Of course, this would require 
more extensive SMARTS based substructures to use the Kekule form.

Here is an example

import rdkit
from rdkit import Chem

print(rdkit.__version__)
smiles = 'c1cccn1C'

mol = Chem.MolFromSmiles(smiles)
Chem.Kekulize(mol,clearAromaticFlags=True)
sub = Chem.MolFromSmarts('[CH3]')
newmol = Chem.rdmolops.DeleteSubstructs(mol,sub)
Chem.SanitizeMol(newmol)
print("1: {}".format(Chem.MolToSmiles(newmol)))

mol = Chem.MolFromSmiles(smiles)
sub = Chem.MolFromSmarts('[CH3]')
newmol = Chem.rdmolops.DeleteSubstructs(mol,sub)
print("2: {}".format(Chem.MolToSmiles(newmol)))
Chem.SanitizeMol(newmol)
print("3: {}".format(Chem.MolToSmiles(newmol)))

With output

2021.03.2
1: c1cc[nH]c1
2: c1ccnc1
[09:50:41] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4

Traceback (most recent call last):
  File "test.py", line 21, in 
Chem.SanitizeMol(newmol)
rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol.  Unkekulized atoms: 0 
1 2 3 4


Thanks

Stephen

GSK monitors email communications sent to and from GSK in order to protect GSK, 
our employees, customers, suppliers and business partners, from cyber threats 
and loss of GSK Information. GSK monitoring is conducted with appropriate 
confidentiality controls and in accordance with local laws and after 
appropriate consultation.



This e-mail was sent by GlaxoSmithKline Services Unlimited
(registered in England and Wales No. 1047315), which is a
member of the GlaxoSmithKline group of companies. The
registered address of GlaxoSmithKline Services Unlimited
is 980 Great West Road, Brentford, Middlesex TW8 9GS.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss