Re: [Rdkit-discuss] Folding count vectors
Hi Francois, I agree with your suggestion. I am also CCing Greg on this response. I have tried to look around on google for viewing the source code of the CreateDifferenceFingerprintForReaction method but the most relevant pages I can find describing what the code does are [here]( https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html) and [here]( https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction ) I don't mind if the source is only in C++ but where can I find it? If I can view the source code I could understand how folding a count vector was implemented. As of right now I am assuming the implementation is similar to folding a bit vector just applying a SUM instead of a logical OR. v/r, Ben On Wed, Nov 20, 2019 at 3:23 AM Francois Berenger wrote: > On 20/11/2019 02:00, Benjamin Datko wrote: > > Hello Francois, > > > > I am trying to replicate some of the functionality of > > CreateDifferenceFingerprintForReaction [Ref 1] for my own > > understanding on how the code works. The function > > CreateDifferenceFingerprintForReaction allows for three difference > > fingerprint representation of the molecules: AtomPair, Morgan, and > > TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and > > the function allows for variable fingerprint size output. > > Personally, I wouldn't try to fold a count vector. > They are sparse vectors, so they don't take a lot of memory. > Also, they are less information lossy than binary fingerprints. > > But, maybe Greg has some hack around, if you are really forced to do > this. > > > I was following this post [Ref 4] describing how to create reaction > > difference fingerprints using different fingerprints representation. > > Using the code from the post I can create reaction difference > > fingerprints using either Morgan or AtomPair, but comparing the output > > from the post [Ref 4] to CreateDifferenceFingerprintForReaction > > results in different size fingerprints, with different values within > > the fingerprint, and different densities. I am assuming this due to > > folding the count vector down to the default fingerprint size of 2048. > > > > > > Example code snippet: > > > > # The below defs are from the post > > https://sourceforge.net/p/rdkit/mailman/message/35240736/ > > > > from rdkit import Chem > > from rdkit.Chem import AllChem > > from rdkit import DataStructs > > import copy > > > > def _createFP(mol,maxSize,fpType='AP'): > > mol.UpdatePropertyCache(False) > > if fpType == 'AP': > > return AllChem.GetAtomPairFingerprint(mol, minLength=1, > > maxLength=maxSize) > > else: > > Chem.GetSSSR(mol) > > rinfo = mol.GetRingInfo() > > return AllChem.GetMorganFingerprint(mol, radius=maxSize) > > > > def getSumFps(fps): > > summedFP = copy.deepcopy(fps[0]) > > for fp in fps[1:]: > > summedFP += fp > > return summedFP > > > > def buildReactionFP(rxn, maxSize=3, fpType='AP'): > > reactants = rxn.GetReactants() > > products = rxn.GetProducts() > > rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in > > reactants]) > > pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in > > products]) > > return pFP-rFP > > > >>>> rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1' > > , useSmiles=True) > > > >>>> rxfp1 = buildReactionFP(rxn1,maxSize=2) > > > >>>> rxfp1.GetNonzeroElements() > > {558114: -2, 574497: -1, 1066050: 2, 1066081: 1} > > > >>>> rxfp1.GetLength() > > 8388608 > > > > # Same reaction now using CreateDifferenceFingerprintForReaction > >>>> rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1) > > > >>>> rxn1_fp.GetNonzeroElements() > > > > {1048: 10, > > 1310: -20, > > 1325: 20, > > 1372: -10, > > 1390: 20, > > 1692: -10, > > 1757: -20, > > 1772: 10} > > > >>>> print(rxn1_fp.GetLength(),rxfp1.GetLength()) > > 2048 8388608 > > > > References > > 1. > > > https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction > > 2. > > > https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html > > 3. > > > https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints
Re: [Rdkit-discuss] Atom pairs, pi-electrons Imidazole
Hi Greg, My goal when bringing up the example of imidazole was to determine what the AtomPair pi-electron number was counting, and test my understanding after reviewing the [original paper](https://pubs.acs.org/doi/10.1021/ci00046a002). If the AtomPair implementation was counting the number of pi-electron each atom was contributing, then I thought inputing imidazole would show one and two for the nitrogens. Once I found that discrepancy I asked the question. (you could consider that it's really counting the number of orbitals > involved in pi bonding). Ah, knowing this the convention now makes interrupting the AtomPairs and Topological Torsions easier. Maybe this should be documented within the module itself? v/r, Ben On Mon, Dec 2, 2019 at 4:40 AM Greg Landrum wrote: > > The functions in the AtomPairs module are really intended to be used when > calculating the Atom Pairs and Topological Torsions fingerprints. For those > fingerprints the RDKit considers atoms in aromatic rings to have a single > pi electron (you could consider that it's really counting the number of > orbitals involved in pi bonding). > > Are you looking to calculate the number of electrons in a pi system or > something like that? Since that's not currently immediately possible > (though that's fixable), I'm curious why you want to do that. > > -greg > > > > On Fri, Nov 22, 2019 at 5:08 AM Benjamin Datko < > benjamin.datko@gmail.com> wrote: > >> Hi all, >> >> When counting the number of pi-electrons in Imidazole, I would expect a >> total of 6 pi-electrons, and the contribution of pi-electrons from the >> nitrogen atoms showing 2 and 1. >> >> But both pyScorePair and NumPiElectrons within AtomPairs only show 5 >> pi-electrons, missing the pi-electron from one of the nitrogens. What am >> I missing? >> >> >>> from rdkit import rdBase >> >>> rdBase.rdkitVersion,rdBase.boostVersion >> >> ('2019.03.4', '1_70') >> >> >>> from rdkit import Chem >> >>> from rdkit.Chem.AtomPairs import Pairs >> >> >>> m = Chem.MolFromSmiles('C1=CN=CN1') >> >>> score = Pairs.pyScorePair(m.GetAtomWithIdx(2),m.GetAtomWithIdx(4),3) >> >>> Pairs.ExplainPairScore(score) >> (('N', 2, 1), 3, ('N', 2, 1)) >> >> >>> [Chem.AtomPairs.Utils.NumPiElectrons(atom) for atom in m.GetAtoms()] >> [1, 1, 1, 1, 1] >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Atom pairs, pi-electrons Imidazole
Hello all, Digging around within the documentation the imidazole example I shared is yet another example on how RDKit defines aromaticity. I know there might be a lot of nuance here since RDKit already makes it clear aromaticity is difficult, but here is what I don't understand. RDKit gives examples of their aromaticity model, found [here]( https://www.rdkit.org/docs/RDKit_Book.html#the-rdkit-aromaticity-model). Applying the examples given in the table possibly would explain the behavior I am observing, but the nomenclature is going over my head. For example the table shows an atomic environment 'c(a)a' contributing one pi-electron. Breaking down the terms, 'a' is any aromatic atom (this is given in the footnote of the table), and I assume 'c' is a carbon atom. Using the daylight nomenclature I would read the parenthesized expression as a branch containing any aromatic atom. Is that correct? Also, if the atomic environment of 'c(a)a' is showing at most three atoms (one carbon, one branching aromatic atom, and then another aromatic atom), which atom is contributing the single pi-electron. If anyone had any thoughts on how to read the table, it would be appreciated. I didn't have any responses on my last email, but I thought I would add what I dug up. Hopping it might help. I thought the AtomParis might be using another flavor of aromaticity, such as [MDL]( https://www.rdkit.org/docs/RDKit_Book.html#the-mdl-aromaticity-model), where five-member rings are never considered aromatic. This doesn't seem to be the case since every atom is aromatic, and every bond in the ring is aromatic. I found the isRingAromatic from [Greg]( https://sourceforge.net/p/rdkit/mailman/message/23801106/) >>> from rdkit import Chem >>> def isRingAromatic(mol,bondRing): ... for id in bondRing: ... if not mol.GetBondWithIdx(id).GetIsAromatic(): ... return False ... return True ... >>> m = Chem.MolFromSmiles('C1=CN=CN1') >>> ri = m.GetRingInfo() >>> isRingAromatic(m,ri.BondRings()[0]) True >>> [m.GetBondWithIdx(id).GetIsAromatic() for id in range(0,5)] [True, True, True, True, True] v/r, Ben On Thu, Nov 21, 2019 at 11:06 PM Benjamin Datko < benjamin.datko@gmail.com> wrote: > Hi all, > > When counting the number of pi-electrons in Imidazole, I would expect a > total of 6 pi-electrons, and the contribution of pi-electrons from the > nitrogen atoms showing 2 and 1. > > But both pyScorePair and NumPiElectrons within AtomPairs only show 5 > pi-electrons, missing the pi-electron from one of the nitrogens. What am > I missing? > > >>> from rdkit import rdBase > >>> rdBase.rdkitVersion,rdBase.boostVersion > > ('2019.03.4', '1_70') > > >>> from rdkit import Chem > >>> from rdkit.Chem.AtomPairs import Pairs > > >>> m = Chem.MolFromSmiles('C1=CN=CN1') > >>> score = Pairs.pyScorePair(m.GetAtomWithIdx(2),m.GetAtomWithIdx(4),3) > >>> Pairs.ExplainPairScore(score) > (('N', 2, 1), 3, ('N', 2, 1)) > > >>> [Chem.AtomPairs.Utils.NumPiElectrons(atom) for atom in m.GetAtoms()] > [1, 1, 1, 1, 1] > > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Atom pairs, pi-electrons Imidazole
Hi all, When counting the number of pi-electrons in Imidazole, I would expect a total of 6 pi-electrons, and the contribution of pi-electrons from the nitrogen atoms showing 2 and 1. But both pyScorePair and NumPiElectrons within AtomPairs only show 5 pi-electrons, missing the pi-electron from one of the nitrogens. What am I missing? >>> from rdkit import rdBase >>> rdBase.rdkitVersion,rdBase.boostVersion ('2019.03.4', '1_70') >>> from rdkit import Chem >>> from rdkit.Chem.AtomPairs import Pairs >>> m = Chem.MolFromSmiles('C1=CN=CN1') >>> score = Pairs.pyScorePair(m.GetAtomWithIdx(2),m.GetAtomWithIdx(4),3) >>> Pairs.ExplainPairScore(score) (('N', 2, 1), 3, ('N', 2, 1)) >>> [Chem.AtomPairs.Utils.NumPiElectrons(atom) for atom in m.GetAtoms()] [1, 1, 1, 1, 1] ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Folding count vectors
Hello Francois, I am trying to replicate some of the functionality of CreateDifferenceFingerprintForReaction [Ref 1] for my own understanding on how the code works. The function CreateDifferenceFingerprintForReaction allows for three difference fingerprint representation of the molecules: AtomPair, Morgan, and TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and the function allows for variable fingerprint size output. I was following this post [Ref 4] describing how to create reaction difference fingerprints using different fingerprints representation. Using the code from the post I can create reaction difference fingerprints using either Morgan or AtomPair, but comparing the output from the post [Ref 4] to CreateDifferenceFingerprintForReaction results in different size fingerprints, with different values within the fingerprint, and different densities. I am assuming this due to folding the count vector down to the default fingerprint size of 2048. Example code snippet: # The below defs are from the post https://sourceforge.net/p/rdkit/mailman/message/35240736/ from rdkit import Chem from rdkit.Chem import AllChem from rdkit import DataStructs import copy def _createFP(mol,maxSize,fpType='AP'): mol.UpdatePropertyCache(False) if fpType == 'AP': return AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=maxSize) else: Chem.GetSSSR(mol) rinfo = mol.GetRingInfo() return AllChem.GetMorganFingerprint(mol, radius=maxSize) def getSumFps(fps): summedFP = copy.deepcopy(fps[0]) for fp in fps[1:]: summedFP += fp return summedFP def buildReactionFP(rxn, maxSize=3, fpType='AP'): reactants = rxn.GetReactants() products = rxn.GetProducts() rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in reactants]) pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in products]) return pFP-rFP >>> rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1' , useSmiles=True) >>> rxfp1 = buildReactionFP(rxn1,maxSize=2) >>> rxfp1.GetNonzeroElements() {558114: -2, 574497: -1, 1066050: 2, 1066081: 1} >>> rxfp1.GetLength() 8388608 # Same reaction now using CreateDifferenceFingerprintForReaction >>> rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1) >>> rxn1_fp.GetNonzeroElements() {1048: 10, 1310: -20, 1325: 20, 1372: -10, 1390: 20, 1692: -10, 1757: -20, 1772: 10} >>> print(rxn1_fp.GetLength(),rxfp1.GetLength()) 2048 8388608 References 1. https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction 2. https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html 3. https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints 4. https://sourceforge.net/p/rdkit/mailman/message/35240736/ v/r, Ben On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger wrote: > On 19/11/2019 03:34, Benjamin Datko wrote: > > Hello all, > > > > I am curious on how to fold a count vector fingerprint. I understand > > when folding bit vectors the most common way is to split the vector in > > half, and apply a bitwise OR operation. I think this is how the > > function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me > > if I am wrong. > > > > How does RDKit and or what is the appropriate way to fold count > > vectors such as AtomPair, Morgan, and Topological torsion? > > Can you give us some context? Why do you want to do that? > > Maybe, you can use the following in order to create > shorter "fingerprints" for which the Tanimoto distance is > still computable (despite becoming approximate then): > > --- > Shrivastava, A. (2016). > Simple and efficient weighted minwise hashing. > In Advances in Neural Information Processing Systems (pp. 1498-1506). > > > https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf > --- > > Regards, > F. > > > I thought about turning the fingerprint into a bit vector using their > > respected "AsBitVect" method then folding using > > rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't > > have a "AsBitVect" method > > [https://www.rdkit.org/docs/GettingStartedInPython.html]. > > > > For an explicit example using AtomPair fingerprint we can see the > > fingerprint is extremely sparse. Could this AtomPair fingerprint be > > folded to increase the density? > > > >>>> from rdkit import Chem > > > >>>> from rdkit.Chem import AllChem > > > >>>> mol = Chem.MolFromSmiles('CC1C1') > >>>> ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, >
[Rdkit-discuss] Folding count vectors
Hello all, I am curious on how to fold a count vector fingerprint. I understand when folding bit vectors the most common way is to split the vector in half, and apply a bitwise OR operation. I think this is how the function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me if I am wrong. How does RDKit and or what is the appropriate way to fold count vectors such as AtomPair, Morgan, and Topological torsion? I thought about turning the fingerprint into a bit vector using their respected "AsBitVect" method then folding using rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't have a " AsBitVect" method [https://www.rdkit.org/docs/GettingStartedInPython.html]. For an explicit example using AtomPair fingerprint we can see the fingerprint is extremely sparse. Could this AtomPair fingerprint be folded to increase the density? >>> from rdkit import Chem >>> from rdkit.Chem import AllChem >>> mol = Chem.MolFromSmiles('CC1C1') >>> ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=3) >>> number_of_nonzero_elements = len(ap_fp.GetNonzeroElements().values()) >>> print((ap_fp.GetLength(),number_of_nonzero_elements)) (8388608,9) Very Respectfully, Ben ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] parsing reactions for reactants, agents, products
Hi Greg, Thanks for all the info! After Hongbin's email, I found a DayLight page on reaction agents ( https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html) but I did not realize there was a full manual on the standardization. The method RemoveUnmappedReactantTemplates() on the chemical reaction > object is there for a few reasons. The primary one is that "real world" > reaction data often isn't 100% clean and can include solvents/reagents in > the reactants section (be that in SMILES or RXN files). > RemoveUnmappedReactantTemplates() solves that using a simple heuristic: > "reactants" that contain more than a threshold percentage of unmapped atoms > are either completely removed or marked as agents Yes, thank you for explaining the reasoning of the implementation. I saw this idea, first, from your paper ( https://pubs.acs.org/doi/10.1021/ci5006614). After playing with your Jupyter notebooks and going the objects and methods stepwise I stumbled upon the RemoveUnmappedReactantTemplates method. Gotta say, thank you for supplying the notebooks in your SI. Walking through the notebooks really expedites my learning of RDKit. =) Very Respectfully, Ben On Wed, Oct 23, 2019 at 2:37 AM Greg Landrum wrote: > Hi Benjamin, > > On Tue, Oct 22, 2019 at 4:32 PM Benjamin Datko < > benjamin.datko@gmail.com> wrote: > >> Hi Hongbin, >> >> Thank you for breaking the code down. I am still new to python and all >> its pythonic ways. I did not notice that reactants, agents, and products >> were delimited by '>'. After you break it down, the code does lose its >> magic. =P >> > > Yeah, as long as you have clean reaction data that works great (see below). > > >> >> Do happen to have any good references on hand that describes some of the >> standards used in RDKit or the Cheminformatics on molecular and reaction >> repersentation? >> > > The SMiLES/SMARTS based formats are documented here: > https://www.daylight.com/dayhtml/doc/theory/ > That's a great place to start. > > >> The second question corresponds to the discussion I found in this thread ( >> https://sourceforge.net/p/rdkit/mailman/message/36316849/). I believe >> this parameters correspond to the PgSQL RDKit implementation, but I am not >> sure. Below I show a recursive search from the downloaded source of RDKit >> from GitHub (https://github.com/greglandrum/rdkit). >> > > The method RemoveUnmappedReactantTemplates() on the chemical reaction > object is there for a few reasons. The primary one is that "real world" > reaction data often isn't 100% clean and can include solvents/reagents in > the reactants section (be that in SMILES or RXN files). > RemoveUnmappedReactantTemplates() solves that using a simple heuristic: > "reactants" that contain more than a threshold percentage of unmapped atoms > are either completely removed or marked as agents > > Hope this helps. > -greg > > >> $ pwd >> Downloads/rdkit-master >> >> $ grep -r move_unmmapped_reactants_to_agents . >> ./Code/PgSQL/rdkit/guc.c:static bool >> *rdkit_move_unmmapped_reactants_to_agents* = true; >> ./Code/PgSQL/rdkit/guc.c: "*rdkit.move_unmmapped_reactants_to_agents*", >> ./Code/PgSQL/rdkit/guc.c: &*rdkit_move_unmmapped_reactants_to_agents* >> , >> ./Code/PgSQL/rdkit/guc.c: return >> *rdkit_move_unmmapped_reactants_to_agents*; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> >> $ grep -r threshold_unmapped_reactant_atoms . >> ./Code/PgSQL/rdkit/guc.c:static double rdkit_ >> *threshold_unmapped_reactant_atom*s = 0.2; >> ./Code/PgSQL/rdkit/guc.c: "rdkit.*threshold_unma
Re: [Rdkit-discuss] parsing reactions for reactants, agents, products
Hi Hongbin, Thank you for breaking the code down. I am still new to python and all its pythonic ways. I did not notice that reactants, agents, and products were delimited by '>'. After you break it down, the code does lose its magic. =P Do happen to have any good references on hand that describes some of the standards used in RDKit or the Cheminformatics on molecular and reaction repersentation? The second question corresponds to the discussion I found in this thread ( https://sourceforge.net/p/rdkit/mailman/message/36316849/). I believe this parameters correspond to the PgSQL RDKit implementation, but I am not sure. Below I show a recursive search from the downloaded source of RDKit from GitHub (https://github.com/greglandrum/rdkit). $ pwd Downloads/rdkit-master $ grep -r move_unmmapped_reactants_to_agents . ./Code/PgSQL/rdkit/guc.c:static bool *rdkit_move_unmmapped_reactants_to_agents* = true; ./Code/PgSQL/rdkit/guc.c: "*rdkit.move_unmmapped_reactants_to_agents*", ./Code/PgSQL/rdkit/guc.c: &*rdkit_move_unmmapped_reactants_to_agents*, ./Code/PgSQL/rdkit/guc.c: return *rdkit_move_unmmapped_reactants_to_agents* ; ./Code/PgSQL/rdkit/expected/reaction.out:SET *rdkit.move_unmmapped_reactants_to_agents*=true; ./Code/PgSQL/rdkit/expected/reaction.out:SET *rdkit.move_unmmapped_reactants_to_agents*=false; ./Code/PgSQL/rdkit/expected/reaction.out:SET *rdkit.move_unmmapped_reactants_to_agents*=true; ./Code/PgSQL/rdkit/expected/reaction.out:SET *rdkit.move_unmmapped_reactants_to_agents*=false; ./Code/PgSQL/rdkit/expected/reaction.out:SET *rdkit.move_unmmapped_reactants_to_agents*=true; ./Code/PgSQL/rdkit/sql/reaction.sql:SET *rdkit.move_unmmapped_reactants_to_agents*=true; ./Code/PgSQL/rdkit/sql/reaction.sql:SET *rdkit.move_unmmapped_reactants_to_agents*=false; ./Code/PgSQL/rdkit/sql/reaction.sql:SET *rdkit.move_unmmapped_reactants_to_agents*=true; ./Code/PgSQL/rdkit/sql/reaction.sql:SET *rdkit.move_unmmapped_reactants_to_agents*=false; ./Code/PgSQL/rdkit/sql/reaction.sql:SET *rdkit.move_unmmapped_reactants_to_agents*=true; $ grep -r threshold_unmapped_reactant_atoms . ./Code/PgSQL/rdkit/guc.c:static double rdkit_ *threshold_unmapped_reactant_atom*s = 0.2; ./Code/PgSQL/rdkit/guc.c: "rdkit.*threshold_unmapped_reactant_atoms*", ./Code/PgSQL/rdkit/guc.c: &*rdkit_threshold_unmapped_reactant_atoms*, ./Code/PgSQL/rdkit/guc.c: return rdkit_*threshold_unmapped_reactant_atoms* ; ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit. *threshold_unmapped_reactant_atoms*=0.2; ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit. *threshold_unmapped_reactant_atoms*=0.9; ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit. *threshold_unmapped_reactant_atoms*=0.2; ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit. *threshold_unmapped_reactant_atoms*=0.2; ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit. *threshold_unmapped_reactant_atoms*=0.9; ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit. *threshold_unmapped_reactant_atoms*=0.2; On Tue, Oct 22, 2019 at 2:29 AM Hongbin Yang wrote: > Hi Benjamin, > > The magic code uses a feature of python named "list comprehension". > https://www.pythonforbeginners.com/basics/list-comprehensions-in-python > > It does not read the rxn string directly, but splits the string first. > Since the reaction string should be `reactants smiles>agents smiles>product > smiles`, we can get these SMILES strings by "rxn_string.split('>')". > Then for each part, we can use splitter "." to get single molecules. So > finally, [mols.split('.') for mols in rxn_string.split('>')] becomes > [[reactant1, reactant2, ..], [agent1, agent2, ..], [product1, product2, > ...]]. But they are all SMILES strings. > > mols_from_smiles_list is defined here: > https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py#L16 > It just reads the smiles strings in a list into a molecule list. The only > API is uses is "Chem.MolFromSmiles". > > The magic code can be translated into: > > reactants_smiles, agents_smiles, product_smiles= mols in > rxn_string.split('>') > package_results = [] > for mols in reactants_smiles, agents_smiles, product_smiles: > x = mols.split('.') > y = mols_from_smiles_list(x) # x is a list of SMILES, and y is a list > of molecule objects > package_results.append(y) > reactants, agents, products = package_results > > The code now is not cool enough. > > I have no idea with the second question. May I ask where the > parameters threshold_unmapped_reactant_atoms and > move_unmmapped_reactants_to_agents > are defined? > > Best, > > Hongbin Yang 杨弘宾, Ph.D. > Research: Toxicophore and Chemoinformatics > On 10/22/2019 13:08,Benjamin Datko > wrote: > > Hello all, > > While reading the source code for ASKCOS ( > https://github.com/connorcoley/ASKCOS/blob/mast
[Rdkit-discuss] parsing reactions for reactants, agents, products
Hello all, While reading the source code for ASKCOS ( https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py) I noticed this code snippet (line 216 on the GitHub): reactants, agents, products = [mols_from_smiles_list(x) for x in [mols.split('.') for mols in rxn_string.split('>')]] When the above code is applied on a SMILES reaction string, the result unpacks the reactants, agents, and products mol objects into the respected variables, with pretty good accuracy. The function 'mols_from_smiles' essentially just applies Chem.MolFromSmiles over a list of smiles. I think this code snippet is really cool but I cannot find any documentation on how this is working. Searching this mailing list I came across the thread (https://sourceforge.net/p/rdkit/mailman/message/36316849/) where this operation of labeling reactants, agents, and products seems to be determined by the threshold_unmapped_reactant_atoms explained in the quoted text from the message (linked above) Here's what's going on: By default the cartridge code does an extra step > after reading a reaction from SMILES/SMARTS: it looks at all the reactants > and moves any that don't have a sufficient fraction of mapped atoms to the > agents. We do this by default because the reactions that we found "in the > wild" often have agents, solvents, etc. mixed in with the reactants. The > key parameter used there is threshold_unmapped_reactant_atoms, which > defaults to 0.2. The only further reading I can find is from Greg's paper ( https://pubs.acs.org/doi/10.1021/ci5006614). I have two main questions: 1. Where in the code is this atom mapping being applied? I cannot tell when this method is being applied or where the meta data is being saved. Applying the code snippet above to a SMILES reaction string results in a list of rdkit.Chem.rdchem.Mol objects. I cannot seem to find any static method or attributes specifying if it's a reactant, agent, or product when inspecting a mol object using help in a python terminal. 2. How can I change the value of the variables threshold_unmapped_reactant_atoms and move_unmmapped_reactants_to_agents? I am using rdkit version 2019.03.4 in an Anaconda environment. I want to experiment changing the mapping threshold. Very Respectfully, Benjamin ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss