Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-28 Thread Paolo Tosco
HI JP,

you are welcome, thanks a lot for reporting the problem with a reproducible!
No need to bother filing a GitHub issue, I have already done that and also
submitted a fix:
https://github.com/rdkit/rdkit/pull/4282

Reionizing is good to make sure that charges are shuffled around if needed
and localized on the most appropriate groups based on their
acidity/basicity.
I normally run the Reionizer as part of the standardization pipeline, even
though in most cases it will not actually do anything to the molecule.

Cheers,
p.

On Mon, Jun 28, 2021 at 10:43 AM JP Ebejer  wrote:

> Hi Paolo!
>
> Nice to hear from you -- and thanks for the lightning-fix+working
> example.  Very helpful as usual.  (I don't imagine you need me to open a
> github issue on this, but I'd be happy to if you think that is helpful/want
> to keep a record).
>
> Any thoughts on whether it is useful to reionize after neutralizing
> charges in the pipeline above?
>
> Many thanks,
>
> On Thu, 24 Jun 2021 at 18:58, Paolo Tosco 
> wrote:
>
>> Hi JP,
>>
>> the problem is caused by the reaction SMARTS that standardizes pyridine
>> *N*-oxides being not very specific and also hitting your molecule, which
>> is not actually an *N*-oxide but rather a *N*-hydroxypyridinium ion.
>> I will submit a PR to fix the reaction pattern; in the meantime you can
>> fix the problem by loading a custom list of normalization reaction SMARTS
>> as shown in this gist:
>>
>> https://gist.github.com/ptosco/2b19142ff8fd6afdfee12836cec73d4f
>>
>> HTH, cheers
>> p.
>>
>> On Thu, Jun 24, 2021 at 11:40 AM JP Ebejer 
>> wrote:
>>
>>> Apologies I took my sweet time to reply, I went down the standardization
>>> rabbit-hole and went through most of the material (thanks Matthew and
>>> Francois, but also links from other notebooks).  The recording of the
>>> OpenScience session is excellent and crystal clear as usual Greg.  I
>>> enjoyed that.
>>>
>>> I have collated code to do the standardization as follows (I am putting
>>> this here, for when my future self searches this list for the same thing in
>>> 6 years time*):
>>>
>>> 0. Cleanup
>>> 1. FragmentParent
>>> 2. Uncharge
>>> 3. Canonicalize Tautomer
>>>
>>> My only question left, is whether I should reionize between steps 2 and
>>> 3.  What do you think?  My opinion is, probably, that there is no harm in
>>> doing so (so I should do it).  Earlier, Greg said that cleanup does
>>> reionization, but perhaps it is worth redoing after the uncharge step?  Or
>>> is this just a waste of CPU cycles?  Any thoughts?
>>>
>>> Also, there is something slightly weird going on.  A (successfully)
>>> sanitized mol from SMILES "Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O", which when
>>> passed to Cleanup(...) starts spitting out can't kekulize errors.  I have
>>> created a jupyter notebook to highlight this;
>>> https://nbviewer.jupyter.org/gist/jp-um/7cd80faa794b3545e8aedf838a1e7f6b.
>>> Any ideas what is going on?  IMHO cleanup should not choke on sanitized
>>> (correct) molecules.  Is there a way to catch when these errors happen?  As
>>> a bonus, FragmentParent(...) on the original sanitized molecule also
>>> exhibits this unexpected behaviour (not shown in the notebook). Could this
>>> be because it's doing an internal cleanup?
>>>
>>> * The exact code is here:
>>> https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 15:08, Greg Landrum 
>>> wrote:
>>>
 Hi JP,

 On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer 
 wrote:

>
> I am trying to standardize(/normalize?) some molecules from different
> sources, to generate a set of descriptors for them.  I have done this a
> number of times, and each time I find the process slightly confusing.  I
> have the following questions please, if you don't mind:
>
>
 As a starting point in case you want more information about this topic.
 I did a webinar/presentation on this topic earlier this year as part of
 the RSC Open Science series.

 My materials for that are in github:
 https://github.com/greglandrum/RSC_OpenScience_Standardization_202104
 and there's a youtube recording:
 https://www.youtube.com/watch?v=eWTApNX8dJQ



> 1.  What is the relation between molvs and rdkit (I remember there was
> an integration project between the two a while back).  When I call
> rdMolStandardize does rdkit code or molvs code get called?  The github 
> repo
> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize 
> has.
>

 When you call operations from rdMolStandardize it invokes RDKit code.
 That code was started by Susan Leung as a Google Summer of Code project and
 we have continued to improve and expand that code since then.


> 2.  What is the difference between standardization and normalization
> of a molecule?  Does one automatically imply the other or should these two
> processes be 

Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-28 Thread JP Ebejer
Hi Paolo!

Nice to hear from you -- and thanks for the lightning-fix+working example.
Very helpful as usual.  (I don't imagine you need me to open a github issue
on this, but I'd be happy to if you think that is helpful/want to keep
a record).

Any thoughts on whether it is useful to reionize after neutralizing charges
in the pipeline above?

Many thanks,

On Thu, 24 Jun 2021 at 18:58, Paolo Tosco 
wrote:

> Hi JP,
>
> the problem is caused by the reaction SMARTS that standardizes pyridine
> *N*-oxides being not very specific and also hitting your molecule, which
> is not actually an *N*-oxide but rather a *N*-hydroxypyridinium ion.
> I will submit a PR to fix the reaction pattern; in the meantime you can
> fix the problem by loading a custom list of normalization reaction SMARTS
> as shown in this gist:
>
> https://gist.github.com/ptosco/2b19142ff8fd6afdfee12836cec73d4f
>
> HTH, cheers
> p.
>
> On Thu, Jun 24, 2021 at 11:40 AM JP Ebejer 
> wrote:
>
>> Apologies I took my sweet time to reply, I went down the standardization
>> rabbit-hole and went through most of the material (thanks Matthew and
>> Francois, but also links from other notebooks).  The recording of the
>> OpenScience session is excellent and crystal clear as usual Greg.  I
>> enjoyed that.
>>
>> I have collated code to do the standardization as follows (I am putting
>> this here, for when my future self searches this list for the same thing in
>> 6 years time*):
>>
>> 0. Cleanup
>> 1. FragmentParent
>> 2. Uncharge
>> 3. Canonicalize Tautomer
>>
>> My only question left, is whether I should reionize between steps 2 and
>> 3.  What do you think?  My opinion is, probably, that there is no harm in
>> doing so (so I should do it).  Earlier, Greg said that cleanup does
>> reionization, but perhaps it is worth redoing after the uncharge step?  Or
>> is this just a waste of CPU cycles?  Any thoughts?
>>
>> Also, there is something slightly weird going on.  A (successfully)
>> sanitized mol from SMILES "Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O", which when
>> passed to Cleanup(...) starts spitting out can't kekulize errors.  I have
>> created a jupyter notebook to highlight this;
>> https://nbviewer.jupyter.org/gist/jp-um/7cd80faa794b3545e8aedf838a1e7f6b.
>> Any ideas what is going on?  IMHO cleanup should not choke on sanitized
>> (correct) molecules.  Is there a way to catch when these errors happen?  As
>> a bonus, FragmentParent(...) on the original sanitized molecule also
>> exhibits this unexpected behaviour (not shown in the notebook). Could this
>> be because it's doing an internal cleanup?
>>
>> * The exact code is here:
>> https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 15:08, Greg Landrum 
>> wrote:
>>
>>> Hi JP,
>>>
>>> On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer 
>>> wrote:
>>>

 I am trying to standardize(/normalize?) some molecules from different
 sources, to generate a set of descriptors for them.  I have done this a
 number of times, and each time I find the process slightly confusing.  I
 have the following questions please, if you don't mind:


>>> As a starting point in case you want more information about this topic.
>>> I did a webinar/presentation on this topic earlier this year as part of
>>> the RSC Open Science series.
>>>
>>> My materials for that are in github:
>>> https://github.com/greglandrum/RSC_OpenScience_Standardization_202104
>>> and there's a youtube recording:
>>> https://www.youtube.com/watch?v=eWTApNX8dJQ
>>>
>>>
>>>
 1.  What is the relation between molvs and rdkit (I remember there was
 an integration project between the two a while back).  When I call
 rdMolStandardize does rdkit code or molvs code get called?  The github repo
 for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has.

>>>
>>> When you call operations from rdMolStandardize it invokes RDKit code.
>>> That code was started by Susan Leung as a Google Summer of Code project and
>>> we have continued to improve and expand that code since then.
>>>
>>>
 2.  What is the difference between standardization and normalization of
 a molecule?  Does one automatically imply the other or should these two
 processes be both run on a molecule?

>>>
>>> I would be surprised if there were universal agreement about this, but
>>> when I use the terms normalization typically refers to making changes to
>>> molecules to get "functional groups" (loosely defined) into a normal form,
>>> while standardization is getting the molecules into a standard form in
>>> preparation for doing something with them. Normalization is often part of
>>> standardization, standardization can also include things like stripping
>>> salts, neutralizing molecules, etc.
>>> Normalization involves applying transformations like converting -N(=O)=O
>>> to -[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O;
>>>
>>>
 3.  Specifically, what is the 

Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-24 Thread Paolo Tosco
Hi JP,

the problem is caused by the reaction SMARTS that standardizes
pyridine *N*-oxides
being not very specific and also hitting your molecule, which is not
actually an *N*-oxide but rather a *N*-hydroxypyridinium ion.
I will submit a PR to fix the reaction pattern; in the meantime you can fix
the problem by loading a custom list of normalization reaction SMARTS as
shown in this gist:

https://gist.github.com/ptosco/2b19142ff8fd6afdfee12836cec73d4f

HTH, cheers
p.

On Thu, Jun 24, 2021 at 11:40 AM JP Ebejer  wrote:

> Apologies I took my sweet time to reply, I went down the standardization
> rabbit-hole and went through most of the material (thanks Matthew and
> Francois, but also links from other notebooks).  The recording of the
> OpenScience session is excellent and crystal clear as usual Greg.  I
> enjoyed that.
>
> I have collated code to do the standardization as follows (I am putting
> this here, for when my future self searches this list for the same thing in
> 6 years time*):
>
> 0. Cleanup
> 1. FragmentParent
> 2. Uncharge
> 3. Canonicalize Tautomer
>
> My only question left, is whether I should reionize between steps 2 and
> 3.  What do you think?  My opinion is, probably, that there is no harm in
> doing so (so I should do it).  Earlier, Greg said that cleanup does
> reionization, but perhaps it is worth redoing after the uncharge step?  Or
> is this just a waste of CPU cycles?  Any thoughts?
>
> Also, there is something slightly weird going on.  A (successfully)
> sanitized mol from SMILES "Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O", which when
> passed to Cleanup(...) starts spitting out can't kekulize errors.  I have
> created a jupyter notebook to highlight this;
> https://nbviewer.jupyter.org/gist/jp-um/7cd80faa794b3545e8aedf838a1e7f6b.
> Any ideas what is going on?  IMHO cleanup should not choke on sanitized
> (correct) molecules.  Is there a way to catch when these errors happen?  As
> a bonus, FragmentParent(...) on the original sanitized molecule also
> exhibits this unexpected behaviour (not shown in the notebook). Could this
> be because it's doing an internal cleanup?
>
> * The exact code is here:
> https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/
>
>
>
>
> On Fri, 18 Jun 2021 at 15:08, Greg Landrum  wrote:
>
>> Hi JP,
>>
>> On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer 
>> wrote:
>>
>>>
>>> I am trying to standardize(/normalize?) some molecules from different
>>> sources, to generate a set of descriptors for them.  I have done this a
>>> number of times, and each time I find the process slightly confusing.  I
>>> have the following questions please, if you don't mind:
>>>
>>>
>> As a starting point in case you want more information about this topic.
>> I did a webinar/presentation on this topic earlier this year as part of
>> the RSC Open Science series.
>>
>> My materials for that are in github:
>> https://github.com/greglandrum/RSC_OpenScience_Standardization_202104
>> and there's a youtube recording:
>> https://www.youtube.com/watch?v=eWTApNX8dJQ
>>
>>
>>
>>> 1.  What is the relation between molvs and rdkit (I remember there was
>>> an integration project between the two a while back).  When I call
>>> rdMolStandardize does rdkit code or molvs code get called?  The github repo
>>> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has.
>>>
>>
>> When you call operations from rdMolStandardize it invokes RDKit code.
>> That code was started by Susan Leung as a Google Summer of Code project and
>> we have continued to improve and expand that code since then.
>>
>>
>>> 2.  What is the difference between standardization and normalization of
>>> a molecule?  Does one automatically imply the other or should these two
>>> processes be both run on a molecule?
>>>
>>
>> I would be surprised if there were universal agreement about this, but
>> when I use the terms normalization typically refers to making changes to
>> molecules to get "functional groups" (loosely defined) into a normal form,
>> while standardization is getting the molecules into a standard form in
>> preparation for doing something with them. Normalization is often part of
>> standardization, standardization can also include things like stripping
>> salts, neutralizing molecules, etc.
>> Normalization involves applying transformations like converting -N(=O)=O
>> to -[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O;
>>
>>
>>> 3.  Specifically, what is the difference between
>>> rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol),
>>> rdMolStandardize.Normalize(mol).  Should I call any of these manually three
>>> after I run "standardization/cleaning operations" such as uncharging,
>>> reionizing, etc?
>>>
>>
>> SanitizeMol() is different from the others: it does a small amount of
>> normalization - fixing groups like nitro which are commonly drawn in a
>> hypervalent state but which can be represented in a charge-separated form
>> without needing weird valences - and some 

Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-24 Thread JP Ebejer
Apologies I took my sweet time to reply, I went down the standardization
rabbit-hole and went through most of the material (thanks Matthew and
Francois, but also links from other notebooks).  The recording of the
OpenScience session is excellent and crystal clear as usual Greg.  I
enjoyed that.

I have collated code to do the standardization as follows (I am putting
this here, for when my future self searches this list for the same thing in
6 years time*):

0. Cleanup
1. FragmentParent
2. Uncharge
3. Canonicalize Tautomer

My only question left, is whether I should reionize between steps 2 and 3.
What do you think?  My opinion is, probably, that there is no harm in doing
so (so I should do it).  Earlier, Greg said that cleanup does reionization,
but perhaps it is worth redoing after the uncharge step?  Or is this just a
waste of CPU cycles?  Any thoughts?

Also, there is something slightly weird going on.  A (successfully)
sanitized mol from SMILES "Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O", which when
passed to Cleanup(...) starts spitting out can't kekulize errors.  I have
created a jupyter notebook to highlight this;
https://nbviewer.jupyter.org/gist/jp-um/7cd80faa794b3545e8aedf838a1e7f6b.
Any ideas what is going on?  IMHO cleanup should not choke on sanitized
(correct) molecules.  Is there a way to catch when these errors happen?  As
a bonus, FragmentParent(...) on the original sanitized molecule also
exhibits this unexpected behaviour (not shown in the notebook). Could this
be because it's doing an internal cleanup?

* The exact code is here:
https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/




On Fri, 18 Jun 2021 at 15:08, Greg Landrum  wrote:

> Hi JP,
>
> On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer  wrote:
>
>>
>> I am trying to standardize(/normalize?) some molecules from different
>> sources, to generate a set of descriptors for them.  I have done this a
>> number of times, and each time I find the process slightly confusing.  I
>> have the following questions please, if you don't mind:
>>
>>
> As a starting point in case you want more information about this topic.
> I did a webinar/presentation on this topic earlier this year as part of
> the RSC Open Science series.
>
> My materials for that are in github:
> https://github.com/greglandrum/RSC_OpenScience_Standardization_202104
> and there's a youtube recording:
> https://www.youtube.com/watch?v=eWTApNX8dJQ
>
>
>
>> 1.  What is the relation between molvs and rdkit (I remember there was an
>> integration project between the two a while back).  When I call
>> rdMolStandardize does rdkit code or molvs code get called?  The github repo
>> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has.
>>
>
> When you call operations from rdMolStandardize it invokes RDKit code. That
> code was started by Susan Leung as a Google Summer of Code project and we
> have continued to improve and expand that code since then.
>
>
>> 2.  What is the difference between standardization and normalization of a
>> molecule?  Does one automatically imply the other or should these two
>> processes be both run on a molecule?
>>
>
> I would be surprised if there were universal agreement about this, but
> when I use the terms normalization typically refers to making changes to
> molecules to get "functional groups" (loosely defined) into a normal form,
> while standardization is getting the molecules into a standard form in
> preparation for doing something with them. Normalization is often part of
> standardization, standardization can also include things like stripping
> salts, neutralizing molecules, etc.
> Normalization involves applying transformations like converting -N(=O)=O
> to -[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O;
>
>
>> 3.  Specifically, what is the difference between
>> rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol),
>> rdMolStandardize.Normalize(mol).  Should I call any of these manually three
>> after I run "standardization/cleaning operations" such as uncharging,
>> reionizing, etc?
>>
>
> SanitizeMol() is different from the others: it does a small amount of
> normalization - fixing groups like nitro which are commonly drawn in a
> hypervalent state but which can be represented in a charge-separated form
> without needing weird valences - and some validation - rejecting molecules
> with atoms that have non-physical valences, rejecting molecules that cannot
> be kekulized - and a bunch of chemistry perception - ring finding,
> calculating valences, finding aromatic systems, etc.
>
> rdMolStandardize.Normalize() applies a bunch of standard transformations
> to a molecule.
>
> rdMolStandardize.Cleanup() does a number of standardization operations:
> - removeHs
> - disconnect metal atoms
> - normalize the molecule
> - reionize the molecule
>
> 4.  I understand what uncharge does, but what does reionizer do?
>>
>
> Reionizing does two things:
> 1. adds a charge to a small set of free atoms which are likely

Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-22 Thread Francois Berenger

Dear JP,

To confuse you even more, you can also have a look at the ChEMBL 
open-source molecular standardizer:


https://github.com/chembl/ChEMBL_Structure_Pipeline/blob/master/chembl_structure_pipeline/standardizer.py

No need to thank me. :D

On 18/06/2021 03:12, JP Ebejer wrote:

Dear all,

I am trying to standardize(/normalize?) some molecules from different
sources, to generate a set of descriptors for them.  I have done this
a number of times, and each time I find the process slightly
confusing.  I have the following questions please, if you don't mind:

1.  What is the relation between molvs and rdkit (I remember there was
an integration project between the two a while back).  When I call
rdMolStandardize does rdkit code or molvs code get called?  The github
repo for molvs hasn't been updated in a while (2 yrs), but
rdMolStandardize has.
2.  What is the difference between standardization and normalization
of a molecule?  Does one automatically imply the other or should these
two processes be both run on a molecule?
3.  Specifically, what is the difference between
rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol),
rdMolStandardize.Normalize(mol).  Should I call any of these manually
three after I run "standardization/cleaning operations" such as
uncharging, reionizing, etc?
4.  I understand what uncharge does, but what does reionizer do?
5.  Is there a way to chain operations together
standardize+ChooseLargestFragment+uncharge+normalize (am not sure the
order makes sense here), other than creating a class instance for each
calling the method, returning a new mol and using this mol in the next
operation?

Apologies for the many questions.  Have I missed the documentation
about this?  I have found some excellent examples here:
https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb
(thanks!).  This is not exactly a cleaning pipeline, but still quite
helpful to understand these methods.

Many thanks,
JP
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-18 Thread Greg Landrum
Hi JP,

On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer  wrote:

>
> I am trying to standardize(/normalize?) some molecules from different
> sources, to generate a set of descriptors for them.  I have done this a
> number of times, and each time I find the process slightly confusing.  I
> have the following questions please, if you don't mind:
>
>
As a starting point in case you want more information about this topic.
I did a webinar/presentation on this topic earlier this year as part of the
RSC Open Science series.

My materials for that are in github:
https://github.com/greglandrum/RSC_OpenScience_Standardization_202104
and there's a youtube recording:
https://www.youtube.com/watch?v=eWTApNX8dJQ



> 1.  What is the relation between molvs and rdkit (I remember there was an
> integration project between the two a while back).  When I call
> rdMolStandardize does rdkit code or molvs code get called?  The github repo
> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has.
>

When you call operations from rdMolStandardize it invokes RDKit code. That
code was started by Susan Leung as a Google Summer of Code project and we
have continued to improve and expand that code since then.


> 2.  What is the difference between standardization and normalization of a
> molecule?  Does one automatically imply the other or should these two
> processes be both run on a molecule?
>

I would be surprised if there were universal agreement about this, but when
I use the terms normalization typically refers to making changes to
molecules to get "functional groups" (loosely defined) into a normal form,
while standardization is getting the molecules into a standard form in
preparation for doing something with them. Normalization is often part of
standardization, standardization can also include things like stripping
salts, neutralizing molecules, etc.
Normalization involves applying transformations like converting -N(=O)=O to
-[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O;


> 3.  Specifically, what is the difference between
> rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol),
> rdMolStandardize.Normalize(mol).  Should I call any of these manually three
> after I run "standardization/cleaning operations" such as uncharging,
> reionizing, etc?
>

SanitizeMol() is different from the others: it does a small amount of
normalization - fixing groups like nitro which are commonly drawn in a
hypervalent state but which can be represented in a charge-separated form
without needing weird valences - and some validation - rejecting molecules
with atoms that have non-physical valences, rejecting molecules that cannot
be kekulized - and a bunch of chemistry perception - ring finding,
calculating valences, finding aromatic systems, etc.

rdMolStandardize.Normalize() applies a bunch of standard transformations to
a molecule.

rdMolStandardize.Cleanup() does a number of standardization operations:
- removeHs
- disconnect metal atoms
- normalize the molecule
- reionize the molecule

4.  I understand what uncharge does, but what does reionizer do?
>

Reionizing does two things:
1. adds a charge to a small set of free atoms which are likely counterions.
These include Na, Mg, Cl, etc.
1a. if the above added a positive charge: remove an H from an acidic group
to neutrailze the positive charge that was added.
2. Moves negative charges from less acidic groups to more acidic groups.

5.  Is there a way to chain operations together
> standardize+ChooseLargestFragment+uncharge+normalize (am not sure the order
> makes sense here), other than creating a class instance for each calling
> the method, returning a new mol and using this mol in the next operation?
>

The easy "pipeline" type functions in rdMolStandardize are the xxxParent
functions.
- fragmentParent: cleanup(), pick largest fragment
- chargeParent: fragmentParent(); uncharge()

Note that this list will be more complete in the 2021.09 release.


>
> Apologies for the many questions.  Have I missed the documentation about
> this?  I have found some excellent examples here:
> https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb
> (thanks!).  This is not exactly a cleaning pipeline, but still quite
> helpful to understand these methods.
>
>
The github link I provide above has some more up-to-date information about
what the code currently does.
This all needs to land in the RDKit documentation

-greg
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-17 Thread Matthew Robinson
Hi JP,

Lots of good questions, and it is quite an involved topic.

I'll let others who are more knowledgeable of the background answer
questions on the history and relationship between the tools.

One resource that may be helpful is the
https://github.com/chembl/ChEMBL_Structure_Pipeline repo, which calls many
of the functions you mentioned. Looking into the code explains the order or
steps quite well. It also has an open access article linked in the README,
that explains at least how one group (ChEMBL) handles the process.
https://doi.org/10.1186/s13321-020-00456-1

Best,
Matt

On Thu, Jun 17, 2021 at 2:37 PM JP Ebejer  wrote:

> Dear all,
>
> I am trying to standardize(/normalize?) some molecules from different
> sources, to generate a set of descriptors for them.  I have done this a
> number of times, and each time I find the process slightly confusing.  I
> have the following questions please, if you don't mind:
>
> 1.  What is the relation between molvs and rdkit (I remember there was an
> integration project between the two a while back).  When I call
> rdMolStandardize does rdkit code or molvs code get called?  The github repo
> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has.
> 2.  What is the difference between standardization and normalization of a
> molecule?  Does one automatically imply the other or should these two
> processes be both run on a molecule?
> 3.  Specifically, what is the difference between
> rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol),
> rdMolStandardize.Normalize(mol).  Should I call any of these manually three
> after I run "standardization/cleaning operations" such as uncharging,
> reionizing, etc?
> 4.  I understand what uncharge does, but what does reionizer do?
> 5.  Is there a way to chain operations together
> standardize+ChooseLargestFragment+uncharge+normalize (am not sure the order
> makes sense here), other than creating a class instance for each calling
> the method, returning a new mol and using this mol in the next operation?
>
> Apologies for the many questions.  Have I missed the documentation about
> this?  I have found some excellent examples here:
> https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb
> (thanks!).  This is not exactly a cleaning pipeline, but still quite
> helpful to understand these methods.
>
> Many thanks,
> JP
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] RDKit molecule standardization/normalization protocol

2021-06-17 Thread JP Ebejer
Dear all,

I am trying to standardize(/normalize?) some molecules from different
sources, to generate a set of descriptors for them.  I have done this a
number of times, and each time I find the process slightly confusing.  I
have the following questions please, if you don't mind:

1.  What is the relation between molvs and rdkit (I remember there was an
integration project between the two a while back).  When I call
rdMolStandardize does rdkit code or molvs code get called?  The github repo
for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has.
2.  What is the difference between standardization and normalization of a
molecule?  Does one automatically imply the other or should these two
processes be both run on a molecule?
3.  Specifically, what is the difference between
rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol),
rdMolStandardize.Normalize(mol).  Should I call any of these manually three
after I run "standardization/cleaning operations" such as uncharging,
reionizing, etc?
4.  I understand what uncharge does, but what does reionizer do?
5.  Is there a way to chain operations together
standardize+ChooseLargestFragment+uncharge+normalize (am not sure the order
makes sense here), other than creating a class instance for each calling
the method, returning a new mol and using this mol in the next operation?

Apologies for the many questions.  Have I missed the documentation about
this?  I have found some excellent examples here:
https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb
(thanks!).  This is not exactly a cleaning pipeline, but still quite
helpful to understand these methods.

Many thanks,
JP
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss