Re: [Rdkit-discuss] Folding count vectors

2019-12-02 Thread Benjamin Datko
Hi Francois,

I agree with your suggestion. I am also CCing Greg on this response.

I have tried to look around on google for viewing the source code of the
CreateDifferenceFingerprintForReaction method but the most relevant pages I
can find describing what the code does are [here](
https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html)
and [here](
https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
)

I don't mind if the source is only in C++ but where can I find it? If I can
view the source code I could understand how folding a count vector was
implemented. As of right now I am assuming the implementation is similar
to folding a bit vector just applying a SUM instead of a logical OR.

v/r,

Ben

On Wed, Nov 20, 2019 at 3:23 AM Francois Berenger  wrote:

> On 20/11/2019 02:00, Benjamin Datko wrote:
> > Hello Francois,
> >
> > I am trying to replicate some of the functionality of
> > CreateDifferenceFingerprintForReaction [Ref 1] for my own
> > understanding on how the code works. The function
> > CreateDifferenceFingerprintForReaction allows for three difference
> > fingerprint representation of the molecules: AtomPair, Morgan, and
> > TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and
> > the function allows for variable fingerprint size output.
>
> Personally, I wouldn't try to fold a count vector.
> They are sparse vectors, so they don't take a lot of memory.
> Also, they are less information lossy than binary fingerprints.
>
> But, maybe Greg has some hack around, if you are really forced to do
> this.
>
> > I was following this post [Ref 4] describing how to create reaction
> > difference fingerprints using different fingerprints representation.
> > Using the code from the post I can create reaction difference
> > fingerprints using either Morgan or AtomPair, but comparing the output
> > from the post [Ref 4] to CreateDifferenceFingerprintForReaction
> > results in different size fingerprints, with different values within
> > the fingerprint, and different densities. I am assuming this due to
> > folding the count vector down to the default fingerprint size of 2048.
> >
> >
> > Example code snippet:
> >
> > # The below defs are from the post
> > https://sourceforge.net/p/rdkit/mailman/message/35240736/
> >
> > from rdkit import Chem
> > from rdkit.Chem import AllChem
> > from rdkit import DataStructs
> > import copy
> >
> > def _createFP(mol,maxSize,fpType='AP'):
> > mol.UpdatePropertyCache(False)
> > if fpType == 'AP':
> > return AllChem.GetAtomPairFingerprint(mol, minLength=1,
> > maxLength=maxSize)
> > else:
> > Chem.GetSSSR(mol)
> > rinfo = mol.GetRingInfo()
> > return AllChem.GetMorganFingerprint(mol, radius=maxSize)
> >
> > def getSumFps(fps):
> > summedFP = copy.deepcopy(fps[0])
> > for fp in fps[1:]:
> > summedFP += fp
> > return summedFP
> >
> > def buildReactionFP(rxn, maxSize=3, fpType='AP'):
> > reactants = rxn.GetReactants()
> > products = rxn.GetProducts()
> > rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
> > reactants])
> > pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
> > products])
> > return pFP-rFP
> >
> >>>> rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1'
> > , useSmiles=True)
> >
> >>>> rxfp1 = buildReactionFP(rxn1,maxSize=2)
> >
> >>>> rxfp1.GetNonzeroElements()
> > {558114: -2, 574497: -1, 1066050: 2, 1066081: 1}
> >
> >>>> rxfp1.GetLength()
> > 8388608
> >
> > # Same reaction now using CreateDifferenceFingerprintForReaction
> >>>> rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1)
> >
> >>>> rxn1_fp.GetNonzeroElements()
> >
> > {1048: 10,
> >  1310: -20,
> >  1325: 20,
> >  1372: -10,
> >  1390: 20,
> >  1692: -10,
> >  1757: -20,
> >  1772: 10}
> >
> >>>> print(rxn1_fp.GetLength(),rxfp1.GetLength())
> > 2048 8388608
> >
> > References
> > 1.
> >
> https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
> > 2.
> >
> https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html
> > 3.
> >
> https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints

Re: [Rdkit-discuss] Atom pairs, pi-electrons Imidazole

2019-12-02 Thread Benjamin Datko
Hi Greg,

My goal when bringing up the example of imidazole was to determine what the
AtomPair pi-electron number was counting, and test my understanding after
reviewing the [original paper](https://pubs.acs.org/doi/10.1021/ci00046a002).
If the AtomPair implementation was counting the number of pi-electron each
atom was contributing, then I thought inputing imidazole would show one and
two for the nitrogens. Once I found that discrepancy I asked the question.

(you could consider that it's really counting the number of orbitals
> involved in pi bonding).


Ah, knowing this the convention now makes interrupting the AtomPairs and
Topological Torsions easier. Maybe this should be documented within the
module itself?

v/r,

Ben



On Mon, Dec 2, 2019 at 4:40 AM Greg Landrum  wrote:

>
> The functions in the AtomPairs module are really intended to be used when
> calculating the Atom Pairs and Topological Torsions fingerprints. For those
> fingerprints the RDKit considers atoms in aromatic rings to have a single
> pi electron (you could consider that it's really counting the number of
> orbitals involved in pi bonding).
>
> Are you looking to calculate the number of electrons in a pi system or
> something like that? Since that's not currently immediately possible
> (though that's fixable), I'm curious why you want to do that.
>
> -greg
>
>
>
> On Fri, Nov 22, 2019 at 5:08 AM Benjamin Datko <
> benjamin.datko@gmail.com> wrote:
>
>> Hi all,
>>
>> When counting the number of pi-electrons in Imidazole, I would expect a
>> total of 6 pi-electrons, and the contribution of pi-electrons from the
>> nitrogen atoms showing 2 and 1.
>>
>> But both pyScorePair and NumPiElectrons within AtomPairs only show 5
>> pi-electrons, missing the pi-electron from one of the nitrogens. What am
>> I missing?
>>
>> >>> from rdkit import rdBase
>> >>> rdBase.rdkitVersion,rdBase.boostVersion
>>
>> ('2019.03.4', '1_70')
>>
>> >>> from rdkit import Chem
>> >>> from rdkit.Chem.AtomPairs import Pairs
>>
>> >>> m = Chem.MolFromSmiles('C1=CN=CN1')
>> >>> score = Pairs.pyScorePair(m.GetAtomWithIdx(2),m.GetAtomWithIdx(4),3)
>> >>> Pairs.ExplainPairScore(score)
>> (('N', 2, 1), 3, ('N', 2, 1))
>>
>> >>> [Chem.AtomPairs.Utils.NumPiElectrons(atom) for atom in m.GetAtoms()]
>>  [1, 1, 1, 1, 1]
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Atom pairs, pi-electrons Imidazole

2019-11-24 Thread Benjamin Datko
Hello all,

Digging around within the documentation the imidazole example I shared is
yet another example on how RDKit defines aromaticity. I know there might be
a lot of nuance here since RDKit already makes it clear aromaticity is
difficult, but here is what I don't understand.

RDKit gives examples of their aromaticity model, found [here](
https://www.rdkit.org/docs/RDKit_Book.html#the-rdkit-aromaticity-model).
Applying the examples given in the table possibly would explain the
behavior I am observing, but the nomenclature is going over my head.

For example the table shows an atomic environment 'c(a)a' contributing one
pi-electron. Breaking down the terms, 'a' is any aromatic atom (this is
given in the footnote of the table), and I assume 'c' is a carbon atom.
Using the daylight nomenclature I would read the parenthesized expression
as a branch containing any aromatic atom. Is that correct?

Also, if the atomic environment of 'c(a)a' is showing at most three atoms
(one carbon, one branching aromatic atom, and then another aromatic atom),
which atom is contributing the single pi-electron.

If anyone had any thoughts on how to read the table, it would be
appreciated.

I didn't have any responses on my last email, but I thought I would add
what I dug up. Hopping it might help.

I thought the AtomParis might be using another flavor of aromaticity, such
as [MDL](
https://www.rdkit.org/docs/RDKit_Book.html#the-mdl-aromaticity-model),
where five-member rings are never considered aromatic. This doesn't seem to
be the case since every atom is aromatic, and every bond in the ring is
aromatic. I found the isRingAromatic from [Greg](
https://sourceforge.net/p/rdkit/mailman/message/23801106/)

>>> from rdkit import Chem

>>> def isRingAromatic(mol,bondRing):

...  for id in bondRing:

... if not mol.GetBondWithIdx(id).GetIsAromatic():

... return False

...  return True

...

>>> m = Chem.MolFromSmiles('C1=CN=CN1')

>>> ri = m.GetRingInfo()

>>> isRingAromatic(m,ri.BondRings()[0])

True

>>> [m.GetBondWithIdx(id).GetIsAromatic() for id in range(0,5)]

[True, True, True, True, True]



v/r,

Ben

On Thu, Nov 21, 2019 at 11:06 PM Benjamin Datko <
benjamin.datko@gmail.com> wrote:

> Hi all,
>
> When counting the number of pi-electrons in Imidazole, I would expect a
> total of 6 pi-electrons, and the contribution of pi-electrons from the
> nitrogen atoms showing 2 and 1.
>
> But both pyScorePair and NumPiElectrons within AtomPairs only show 5
> pi-electrons, missing the pi-electron from one of the nitrogens. What am
> I missing?
>
> >>> from rdkit import rdBase
> >>> rdBase.rdkitVersion,rdBase.boostVersion
>
> ('2019.03.4', '1_70')
>
> >>> from rdkit import Chem
> >>> from rdkit.Chem.AtomPairs import Pairs
>
> >>> m = Chem.MolFromSmiles('C1=CN=CN1')
> >>> score = Pairs.pyScorePair(m.GetAtomWithIdx(2),m.GetAtomWithIdx(4),3)
> >>> Pairs.ExplainPairScore(score)
> (('N', 2, 1), 3, ('N', 2, 1))
>
> >>> [Chem.AtomPairs.Utils.NumPiElectrons(atom) for atom in m.GetAtoms()]
>  [1, 1, 1, 1, 1]
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Atom pairs, pi-electrons Imidazole

2019-11-21 Thread Benjamin Datko
Hi all,

When counting the number of pi-electrons in Imidazole, I would expect a
total of 6 pi-electrons, and the contribution of pi-electrons from the
nitrogen atoms showing 2 and 1.

But both pyScorePair and NumPiElectrons within AtomPairs only show 5
pi-electrons, missing the pi-electron from one of the nitrogens. What am
I missing?

>>> from rdkit import rdBase
>>> rdBase.rdkitVersion,rdBase.boostVersion

('2019.03.4', '1_70')

>>> from rdkit import Chem
>>> from rdkit.Chem.AtomPairs import Pairs

>>> m = Chem.MolFromSmiles('C1=CN=CN1')
>>> score = Pairs.pyScorePair(m.GetAtomWithIdx(2),m.GetAtomWithIdx(4),3)
>>> Pairs.ExplainPairScore(score)
(('N', 2, 1), 3, ('N', 2, 1))

>>> [Chem.AtomPairs.Utils.NumPiElectrons(atom) for atom in m.GetAtoms()]
 [1, 1, 1, 1, 1]
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Folding count vectors

2019-11-19 Thread Benjamin Datko
Hello Francois,

I am trying to replicate some of the functionality of
CreateDifferenceFingerprintForReaction [Ref 1] for my own understanding on
how the code works. The function CreateDifferenceFingerprintForReaction
allows for three difference fingerprint representation of the molecules:
AtomPair, Morgan, and TopologicalTorsion [Ref 2]. All three are count
vectors [Ref 3], and the function allows for variable fingerprint size
output.

I was following this post [Ref 4] describing how to create reaction
difference fingerprints using different fingerprints representation. Using
the code from the post I can create reaction difference fingerprints using
either Morgan or AtomPair, but comparing the output from the post [Ref 4]
to CreateDifferenceFingerprintForReaction results in different size
fingerprints, with different values within the fingerprint, and different
densities. I am assuming this due to folding the count vector down to
the default fingerprint size of 2048.

Example code snippet:

# The below defs are from the post
https://sourceforge.net/p/rdkit/mailman/message/35240736/
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import copy

def _createFP(mol,maxSize,fpType='AP'):
mol.UpdatePropertyCache(False)
if fpType == 'AP':
return AllChem.GetAtomPairFingerprint(mol, minLength=1,
maxLength=maxSize)
else:
Chem.GetSSSR(mol)
rinfo = mol.GetRingInfo()
return AllChem.GetMorganFingerprint(mol, radius=maxSize)

def getSumFps(fps):
summedFP = copy.deepcopy(fps[0])
for fp in fps[1:]:
summedFP += fp
return summedFP

def buildReactionFP(rxn, maxSize=3, fpType='AP'):
reactants = rxn.GetReactants()
products = rxn.GetProducts()
rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
reactants])
pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
products])
return pFP-rFP

>>> rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1' ,
useSmiles=True)
>>> rxfp1 = buildReactionFP(rxn1,maxSize=2)

>>> rxfp1.GetNonzeroElements()
{558114: -2, 574497: -1, 1066050: 2, 1066081: 1}

>>> rxfp1.GetLength()
8388608


# Same reaction now using CreateDifferenceFingerprintForReaction
>>> rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1)

>>> rxn1_fp.GetNonzeroElements()
{1048: 10,
 1310: -20,
 1325: 20,
 1372: -10,
 1390: 20,
 1692: -10,
 1757: -20,
 1772: 10}

>>> print(rxn1_fp.GetLength(),rxfp1.GetLength())
2048 8388608

References
1.
https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
2.
https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html
3.
https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints
4. https://sourceforge.net/p/rdkit/mailman/message/35240736/

v/r,

Ben

On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger  wrote:

> On 19/11/2019 03:34, Benjamin Datko wrote:
> > Hello all,
> >
> > I am curious on how to fold a count vector fingerprint. I understand
> > when folding bit vectors the most common way is to split the vector in
> > half, and apply a bitwise OR operation. I think this is how the
> > function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me
> > if I am wrong.
> >
> > How does RDKit and or what is the appropriate way to fold count
> > vectors such as AtomPair, Morgan, and Topological torsion?
>
> Can you give us some context? Why do you want to do that?
>
> Maybe, you can use the following in order to create
> shorter "fingerprints" for which the Tanimoto distance is
> still computable (despite becoming approximate then):
>
> ---
> Shrivastava, A. (2016).
> Simple and efficient weighted minwise hashing.
> In Advances in Neural Information Processing Systems (pp. 1498-1506).
>
>
> https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf
> ---
>
> Regards,
> F.
>
> > I thought about turning the fingerprint into a bit vector using their
> > respected "AsBitVect" method then folding using
> > rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't
> > have a "AsBitVect" method
> > [https://www.rdkit.org/docs/GettingStartedInPython.html].
> >
> > For an explicit example using AtomPair fingerprint we can see the
> > fingerprint is extremely sparse. Could this AtomPair fingerprint be
> > folded to increase the density?
> >
> >>>> from rdkit import Chem
> >
> >>>> from rdkit.Chem import AllChem
> >
> >>>> mol = Chem.MolFromSmiles('CC1C1')
> >>>> ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1,
>

[Rdkit-discuss] Folding count vectors

2019-11-18 Thread Benjamin Datko
Hello all,

I am curious on how to fold a count vector fingerprint. I understand when
folding bit vectors the most common way is to split the vector in half, and
apply a bitwise OR operation. I think this is how the function
rdkit.DataStructs.FoldFingerprint works in RDKit, correct me if I am wrong.

How does RDKit and or what is the appropriate way to fold count vectors
such as AtomPair, Morgan, and Topological torsion?

I thought about turning the fingerprint into a bit vector using their
respected "AsBitVect" method then folding using
rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't have a "
AsBitVect" method [https://www.rdkit.org/docs/GettingStartedInPython.html].

For an explicit example using AtomPair fingerprint we can see the
fingerprint is extremely sparse. Could this AtomPair fingerprint be folded
to increase the density?

>>> from rdkit import Chem
>>> from rdkit.Chem import AllChem

>>> mol = Chem.MolFromSmiles('CC1C1')
>>> ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=3)

>>> number_of_nonzero_elements = len(ap_fp.GetNonzeroElements().values())

>>> print((ap_fp.GetLength(),number_of_nonzero_elements))
(8388608,9)

Very Respectfully,

Ben
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] parsing reactions for reactants, agents, products

2019-10-24 Thread Benjamin Datko
Hi Greg,

Thanks for all the info! After Hongbin's email, I found a DayLight page on
reaction agents (
https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html)
but I did not realize there was a full manual on the standardization.

The method RemoveUnmappedReactantTemplates() on the chemical reaction
> object is there for a few reasons. The primary one is that "real world"
> reaction data often isn't 100% clean and can include solvents/reagents in
> the reactants section (be that in SMILES or RXN files).
> RemoveUnmappedReactantTemplates() solves that using a simple heuristic:
> "reactants" that contain more than a threshold percentage of unmapped atoms
> are either completely removed or marked as agents


Yes, thank you for explaining the reasoning of the implementation. I saw
this idea, first, from your paper (
https://pubs.acs.org/doi/10.1021/ci5006614). After playing with your
Jupyter notebooks and going the objects and methods stepwise I stumbled
upon the RemoveUnmappedReactantTemplates method. Gotta say, thank you for
supplying the notebooks in your SI. Walking through the notebooks really
expedites my learning of RDKit. =)

Very Respectfully,

Ben

On Wed, Oct 23, 2019 at 2:37 AM Greg Landrum  wrote:

> Hi Benjamin,
>
> On Tue, Oct 22, 2019 at 4:32 PM Benjamin Datko <
> benjamin.datko@gmail.com> wrote:
>
>> Hi Hongbin,
>>
>> Thank you for breaking the code down. I am still new to python and all
>> its pythonic ways. I did not notice that reactants, agents, and products
>> were delimited by '>'. After you break it down, the code does lose its
>> magic. =P
>>
>
> Yeah, as long as you have clean reaction data that works great (see below).
>
>
>>
>> Do happen to have any good references on hand that describes some of the
>> standards used in RDKit or the Cheminformatics on molecular and reaction
>> repersentation?
>>
>
> The SMiLES/SMARTS based formats are documented here:
> https://www.daylight.com/dayhtml/doc/theory/
> That's a great place to start.
>
>
>> The second question corresponds to the discussion I found in this thread (
>> https://sourceforge.net/p/rdkit/mailman/message/36316849/). I believe
>> this parameters correspond to the PgSQL RDKit implementation, but I am not
>> sure. Below I show a recursive search from the downloaded source of RDKit
>> from GitHub (https://github.com/greglandrum/rdkit).
>>
>
> The method RemoveUnmappedReactantTemplates() on the chemical reaction
> object is there for a few reasons. The primary one is that "real world"
> reaction data often isn't 100% clean and can include solvents/reagents in
> the reactants section (be that in SMILES or RXN files).
> RemoveUnmappedReactantTemplates() solves that using a simple heuristic:
> "reactants" that contain more than a threshold percentage of unmapped atoms
> are either completely removed or marked as agents
>
> Hope this helps.
> -greg
>
>
>> $ pwd
>> Downloads/rdkit-master
>>
>> $ grep -r move_unmmapped_reactants_to_agents .
>> ./Code/PgSQL/rdkit/guc.c:static bool
>> *rdkit_move_unmmapped_reactants_to_agents* = true;
>> ./Code/PgSQL/rdkit/guc.c:   "*rdkit.move_unmmapped_reactants_to_agents*",
>>  ./Code/PgSQL/rdkit/guc.c:   &*rdkit_move_unmmapped_reactants_to_agents*
>> ,
>> ./Code/PgSQL/rdkit/guc.c:  return
>> *rdkit_move_unmmapped_reactants_to_agents*;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>>
>> $ grep -r threshold_unmapped_reactant_atoms .
>> ./Code/PgSQL/rdkit/guc.c:static double rdkit_
>> *threshold_unmapped_reactant_atom*s = 0.2;
>> ./Code/PgSQL/rdkit/guc.c:   "rdkit.*threshold_unma

Re: [Rdkit-discuss] parsing reactions for reactants, agents, products

2019-10-22 Thread Benjamin Datko
Hi Hongbin,

Thank you for breaking the code down. I am still new to python and all its
pythonic ways. I did not notice that reactants, agents, and products were
delimited by '>'. After you break it down, the code does lose its magic. =P

Do happen to have any good references on hand that describes some of the
standards used in RDKit or the Cheminformatics on molecular and reaction
repersentation?

The second question corresponds to the discussion I found in this thread (
https://sourceforge.net/p/rdkit/mailman/message/36316849/). I believe this
parameters correspond to the PgSQL RDKit implementation, but I am not
sure. Below I show a recursive search from the downloaded source of RDKit
from GitHub (https://github.com/greglandrum/rdkit).

$ pwd
Downloads/rdkit-master

$ grep -r move_unmmapped_reactants_to_agents .
./Code/PgSQL/rdkit/guc.c:static bool
*rdkit_move_unmmapped_reactants_to_agents* = true;
./Code/PgSQL/rdkit/guc.c:   "*rdkit.move_unmmapped_reactants_to_agents*",
 ./Code/PgSQL/rdkit/guc.c:   &*rdkit_move_unmmapped_reactants_to_agents*,
./Code/PgSQL/rdkit/guc.c:  return *rdkit_move_unmmapped_reactants_to_agents*
;
./Code/PgSQL/rdkit/expected/reaction.out:SET
*rdkit.move_unmmapped_reactants_to_agents*=true;
./Code/PgSQL/rdkit/expected/reaction.out:SET
*rdkit.move_unmmapped_reactants_to_agents*=false;
./Code/PgSQL/rdkit/expected/reaction.out:SET
*rdkit.move_unmmapped_reactants_to_agents*=true;
./Code/PgSQL/rdkit/expected/reaction.out:SET
*rdkit.move_unmmapped_reactants_to_agents*=false;
./Code/PgSQL/rdkit/expected/reaction.out:SET
*rdkit.move_unmmapped_reactants_to_agents*=true;
./Code/PgSQL/rdkit/sql/reaction.sql:SET
*rdkit.move_unmmapped_reactants_to_agents*=true;
./Code/PgSQL/rdkit/sql/reaction.sql:SET
*rdkit.move_unmmapped_reactants_to_agents*=false;
./Code/PgSQL/rdkit/sql/reaction.sql:SET
*rdkit.move_unmmapped_reactants_to_agents*=true;
./Code/PgSQL/rdkit/sql/reaction.sql:SET
*rdkit.move_unmmapped_reactants_to_agents*=false;
./Code/PgSQL/rdkit/sql/reaction.sql:SET
*rdkit.move_unmmapped_reactants_to_agents*=true;

$ grep -r threshold_unmapped_reactant_atoms .
./Code/PgSQL/rdkit/guc.c:static double rdkit_
*threshold_unmapped_reactant_atom*s = 0.2;
./Code/PgSQL/rdkit/guc.c:   "rdkit.*threshold_unmapped_reactant_atoms*",
./Code/PgSQL/rdkit/guc.c:   &*rdkit_threshold_unmapped_reactant_atoms*,
./Code/PgSQL/rdkit/guc.c:  return rdkit_*threshold_unmapped_reactant_atoms*
;
./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit.
*threshold_unmapped_reactant_atoms*=0.2;
./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit.
*threshold_unmapped_reactant_atoms*=0.9;
./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit.
*threshold_unmapped_reactant_atoms*=0.2;
./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit.
*threshold_unmapped_reactant_atoms*=0.2;
./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit.
*threshold_unmapped_reactant_atoms*=0.9;
./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit.
*threshold_unmapped_reactant_atoms*=0.2;



On Tue, Oct 22, 2019 at 2:29 AM Hongbin Yang  wrote:

> Hi Benjamin,
>
> The magic code uses a feature of python named "list comprehension".
> https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
>
> It does not read the rxn string directly, but splits the string first.
> Since the reaction string should be `reactants smiles>agents smiles>product
> smiles`, we can get these SMILES strings by "rxn_string.split('>')".
> Then for each part, we can use splitter "." to get single molecules. So
> finally, [mols.split('.') for mols in rxn_string.split('>')] becomes
> [[reactant1, reactant2, ..], [agent1, agent2, ..], [product1, product2,
> ...]]. But they are all SMILES strings.
>
> mols_from_smiles_list is defined here:
> https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py#L16
> It just reads the smiles strings in a list into a molecule list. The only
> API is uses is "Chem.MolFromSmiles".
>
> The magic code can be translated into:
>
> reactants_smiles, agents_smiles, product_smiles= mols in
> rxn_string.split('>')
> package_results = []
> for mols in reactants_smiles, agents_smiles, product_smiles:
>   x = mols.split('.')
>   y = mols_from_smiles_list(x)   # x is a list of SMILES, and y is a list
> of molecule objects
>   package_results.append(y)
> reactants, agents, products = package_results
>
> The code now is not cool enough.
>
> I have no idea with the second question. May I ask where the
> parameters threshold_unmapped_reactant_atoms and 
> move_unmmapped_reactants_to_agents
> are defined?
>
> Best,
>
> Hongbin Yang 杨弘宾, Ph.D.
> Research: Toxicophore and Chemoinformatics
> On 10/22/2019 13:08,Benjamin Datko
>  wrote:
>
> Hello all,
>
> While reading the source code for ASKCOS (
> https://github.com/connorcoley/ASKCOS/blob/mast

[Rdkit-discuss] parsing reactions for reactants, agents, products

2019-10-21 Thread Benjamin Datko
Hello all,

While reading the source code for ASKCOS (
https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py)
I noticed this code snippet (line 216 on the GitHub):

reactants, agents, products = [mols_from_smiles_list(x) for x in
[mols.split('.') for mols in rxn_string.split('>')]]

When the above code is applied on a SMILES reaction string, the result
unpacks the reactants, agents, and products mol objects into the respected
variables, with pretty good accuracy.  The function 'mols_from_smiles'
essentially just applies Chem.MolFromSmiles over a list of smiles.

I think this code snippet is really cool but I cannot find any
documentation on how this is working. Searching this mailing list I came
across the thread (https://sourceforge.net/p/rdkit/mailman/message/36316849/)
where this operation of labeling reactants, agents, and products seems to
be determined by the threshold_unmapped_reactant_atoms explained in the
quoted text from the message (linked above)

Here's what's going on: By default the cartridge code does an extra step
> after reading a reaction from SMILES/SMARTS: it looks at all the reactants
> and moves any that don't have a sufficient fraction of mapped atoms to the
> agents. We do this by default because the reactions that we found "in the
> wild" often have agents, solvents, etc. mixed in with the reactants. The
> key parameter used there is threshold_unmapped_reactant_atoms, which
> defaults to 0.2.


The only further reading I can find is from Greg's paper (
https://pubs.acs.org/doi/10.1021/ci5006614). I have two main questions:

1. Where in the code is this atom mapping being applied? I cannot tell when
this method is being applied or where the meta data is being saved.
Applying the code snippet above to a SMILES reaction string results in a
list of rdkit.Chem.rdchem.Mol objects. I cannot seem to find any static
method or attributes specifying if it's a reactant, agent, or product when
inspecting a mol object using help in a python terminal.

2. How can I change the value of the
variables threshold_unmapped_reactant_atoms
and move_unmmapped_reactants_to_agents? I am using rdkit version 2019.03.4
in an Anaconda environment. I want to experiment changing the mapping
threshold.

Very Respectfully,

Benjamin
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss