Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
Brian, Greg, and David, Thank you for your suggestions. I will try to respond to your questions and comments: I am trying to reproduce results from a literature paper that used non-PYTHON and non-RDkit code to identify certain patterns in molecules as part of a group contribution scheme resulting in the prediction of thermodynamic quantities. I have a training set of molecules and the results of calculations for that training set (individual counts of groups of atoms and resulting energies). Hence, my first goal is to reproduce the results reported for that training set, but using PYTHON and RDkit. Since my goal is to reproduce literature results as closely as possible, I am not in a position to debate the logic of the original authors in their assignments of SMARTS/SMILES matching and counts. After this initial goal is met, I might consider alternative pattern matching and counting schemes and compare those results to the literature results. In fact, that would be good science. As I mentioned in my first email on this topic, I do think I have come up with a "rule" that will give me the correct answer (I have tried it for 8 cases using pencil and paper), my challenge is to code up the "rule" in PYTHON. I am a beginner at PYTHON, so I am struggling to get this idea into functional, bug-free code. Peter Shenkin's idea/code is getting close to what needs to be done, but doesn't handle all the cases. Regards, Jim Metz -Original Message- From: Brian Cole To: James T. Metz Cc: RDKit Discuss Sent: Tue, Nov 7, 2017 7:23 pm Subject: Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match You can use Chem.CanonicalRankAtoms to de-duplicate the SMARTS matches based upon the atom symmetry like this: def count_unique_substructures(smiles, smarts): mol = Chem.MolFromSmiles(smiles) ranks = list(Chem.CanonicalRankAtoms(mol, breakTies=False)) pattern = Chem.MolFromSmarts(smarts) unique_sets_of_atoms = set() for match in mol.GetSubstructMatches(pattern): match_ranks = frozenset([ranks[idx] for idx in match]) unique_sets_of_atoms.add(match_ranks) return len(unique_sets_of_atoms) However, this returns 1 for each of your cases. It's not clear to me why you would want your 2nd case to return 2 as all paths from a chlorine to a chlorine through 2 carbons are symmetric. >>> SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' >>> smiles1 = 'ClC(Cl)CCl' >>> smiles2 = 'ClC(Cl)C(Cl)(Cl)(Cl)' >>> count_unique_substructures(smiles1, SMARTS) 1 >>> count_unique_substructures(smiles2, SMARTS) 1 -Brian On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss wrote: RDkit Discussion Group, I have written a SMARTS to detect vicinal chlorine groups using RDkit. There are 4 atoms involved in a vicinal chlorine group. SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' I am trying to count the number of ("unique") occurrences of this pattern. For some molecules with symmetry, this results in over-counting. For the molecule, smiles1 below, I want to obtain a count of 1 i.e., 1 tuple of 4 atoms. smiles1 = 'ClC(Cl)CCl' However, using the SMARTS above, I obtain 2 tuples of 4 atoms. Beginning with a MOL file representation of smiles1, I get ((1,2,4,3), (0,2,4,3)) One possible solution is to somehow merge the two tuples according to a "rule." One rule that works is "if 3 of the atom indices are the same, then combine into one tuple." However, the rule needs a bit of modification for more complicated cases (higher symmetry). Consider smiles2 = 'ClC(Cl)CCl(Cl)(Cl) My goal is to get 2 tuples of 4 atoms for smiles2 smiles2 is somewhat tricky because there are either 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) tuples depending on how you choose your 3 atom indices. Again, if my goal is to get 2 tuples, then I need to somehow pick the largest group, i.e., 2 groups of 3 tuples to do the merge operation which will give me 2 remaining groups (desired). I have already checked stackoverflow and a few other places for PYTHON code to do the necessary merging, but I could not find anything specific and appropriate. I would be most grateful if anyone has ideas how to do this. I suspect the answer is a few lines of well-written PYTHON code, and not modifying the SMARTS (I could be mistaken!). Thank you. Regards, Jim Metz -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
Peter, Thank you for your suggestions and accompanying code. I have modified your code slightly and have created 3 tuples for testing. Your code works for tuples, match1 and match2, but does not work for match3. The code should return a 2 for match3, because there are 2 sets of 3 tuples each containing 4 atom indices. Using my "rule" that, "if 3 indices are the same, they are in one group and one must form the groups of the largest possible size", one arrives at 2 groups. The merge function should then select one tuple from each group, resulting in a count of 2 (for the final number of groups). Keep in mind that I will not know how many groups of tuples will be created for any given molecule. Hence, I can not use hard coded array indices. Any ideas how to modify the code below to obtain the desired result for tuple, match3, and how to deal with tuples of various sizes? Regards, Jim Metz def merge2(matches): if len(matches) > 1: d = {} for match in matches: t = (matches[0], matches[1]) if (matches[0] < matches[1]): t = (matches[0], matches[1]) else: t = (matches[1], matches[0]) d[t] = match merged_match = (d[t],) else: merged_match = matches count = len(merged_match) return(count) match1 = ((0,2,3,4),) match2 = ((0,2,3,4), (1,2,3,4)) match3 = ((0,2,4,5), (1,2,5,6), (2,3,4,5), (2,3,5,6), (0,2,5,6), (1,2,4,5)) matches = match2 # Change the number to test different tuples output = merge2(matches) print("Output is ", output) -Original Message- From: Peter S. Shenkin To: James T. Metz Cc: RDKit Discuss Sent: Tue, Nov 7, 2017 7:05 pm Subject: Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match I think you probably used a slightly different SMILES than the one you showed. The one you showed should have given ((0,1,3,4),(2,1,3,4)). The proper merge rule would then be to consider all matches equivalent if the 2nd and 3rd atom in the match agree, in any order; i.e, the two carbons, indices 1 and 3 in this case. So to do this, for each molecule, do something like this: d = dict{} for match in matches: t = (match[1], match[2]) if match[1] < match[2] ): t = (match[1], match[2]) else: t = (match[2], match[1]) d[t] = match You will wind up with as many dictionary elements as there are matches. -P. On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss wrote: RDkit Discussion Group, I have written a SMARTS to detect vicinal chlorine groups using RDkit. There are 4 atoms involved in a vicinal chlorine group. SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' I am trying to count the number of ("unique") occurrences of this pattern. For some molecules with symmetry, this results in over-counting. For the molecule, smiles1 below, I want to obtain a count of 1 i.e., 1 tuple of 4 atoms. smiles1 = 'ClC(Cl)CCl' However, using the SMARTS above, I obtain 2 tuples of 4 atoms. Beginning with a MOL file representation of smiles1, I get ((1,2,4,3), (0,2,4,3)) One possible solution is to somehow merge the two tuples according to a "rule." One rule that works is "if 3 of the atom indices are the same, then combine into one tuple." However, the rule needs a bit of modification for more complicated cases (higher symmetry). Consider smiles2 = 'ClC(Cl)CCl(Cl)(Cl) My goal is to get 2 tuples of 4 atoms for smiles2 smiles2 is somewhat tricky because there are either 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) tuples depending on how you choose your 3 atom indices. Again, if my goal is to get 2 tuples, then I need to somehow pick the largest group, i.e., 2 groups of 3 tuples to do the merge operation which will give me 2 remaining groups (desired). I have already checked stackoverflow and a few other places for PYTHON code to do the necessary merging, but I could not find anything specific and appropriate. I would be most grateful if anyone has ideas how to do this. I suspect the answer is a few lines of well-written PYTHON code, and not modifying the SMARTS (I could be mistaken!). Thank you. Regards, Jim Metz -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Check out the v
Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
Hi Jim, Would it not be easier to use a recursive SMARTS, so that you only count the carbon atoms? Something like [$([C,c]Cl)]-,=,:[$([C,c]Cl)], or, more compactly [$([#6]Cl)]~[$([#6]Cl)]. I haven't tested these, as I'm not close to a suitably equipped computer, but you should be able to get the gist at least. The Cl is only defining the sort of C you're after so you won't have to deal with multiple Cl matches on the same atom. Dave On Wed, Nov 8, 2017 at 7:08 AM, Greg Landrum wrote: > Jim, > > I'm a bit confused by what you're trying to do. > > Maybe we can try simplifying. What would you like to have returned for > each of these SMILES: > 1) ClC=CCl > 2) ClC(Cl)=CCl > 3) ClC(Cl)=C(Cl)Cl > > If the answer is the same between 1) and 2), but different for 3), then > the next question will be: "why?" > > -greg > > > On Wed, Nov 8, 2017 at 12:38 AM, James T. Metz via Rdkit-discuss < > rdkit-discuss@lists.sourceforge.net> wrote: > >> RDkit Discussion Group, >> >> I have written a SMARTS to detect vicinal chlorine groups >> using RDkit. There are 4 atoms involved in a vicinal chlorine group. >> >> SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' >> >> I am trying to count the number of ("unique") occurrences of this >> pattern. >> >> For some molecules with symmetry, this results in >> over-counting. >> >> For the molecule, smiles1 below, I want to obtain >> a count of 1 i.e., 1 tuple of 4 atoms. >> >> smiles1 = 'ClC(Cl)CCl' >> >> However, using the SMARTS above, I obtain 2 tuples of 4 atoms. >> Beginning with a MOL file representation of smiles1, I get >> >> ((1,2,4,3), (0,2,4,3)) >> >> One possible solution is to somehow merge the two tuples according >> to a "rule." One rule that works is "if 3 of the atom indices are the >> same, >> then combine into one tuple." >> >> However, the rule needs a bit of modification for more complicated >> cases (higher symmetry). >> >> Consider >> >> smiles2 = 'ClC(Cl)CCl(Cl)(Cl) >> >> My goal is to get 2 tuples of 4 atoms for smiles2 >> >> smiles2 is somewhat tricky because there are either >> 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) >> tuples depending on how you choose your 3 atom indices. >> >> Again, if my goal is to get 2 tuples, then I need to somehow >> pick the largest group, i.e., 2 groups of 3 tuples to do the merge >> operation which will give me 2 remaining groups (desired). >> >> I have already checked stackoverflow and a few other places >> for PYTHON code to do the necessary merging, but I could not >> find anything specific and appropriate. >> >> I would be most grateful if anyone has ideas how to do this. I >> suspect the answer is a few lines of well-written PYTHON code, >> and not modifying the SMARTS (I could be mistaken!). >> >> Thank you. >> >> Regards, >> Jim Metz >> >> >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
Jim, I'm a bit confused by what you're trying to do. Maybe we can try simplifying. What would you like to have returned for each of these SMILES: 1) ClC=CCl 2) ClC(Cl)=CCl 3) ClC(Cl)=C(Cl)Cl If the answer is the same between 1) and 2), but different for 3), then the next question will be: "why?" -greg On Wed, Nov 8, 2017 at 12:38 AM, James T. Metz via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > RDkit Discussion Group, > > I have written a SMARTS to detect vicinal chlorine groups > using RDkit. There are 4 atoms involved in a vicinal chlorine group. > > SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' > > I am trying to count the number of ("unique") occurrences of this > pattern. > > For some molecules with symmetry, this results in > over-counting. > > For the molecule, smiles1 below, I want to obtain > a count of 1 i.e., 1 tuple of 4 atoms. > > smiles1 = 'ClC(Cl)CCl' > > However, using the SMARTS above, I obtain 2 tuples of 4 atoms. > Beginning with a MOL file representation of smiles1, I get > > ((1,2,4,3), (0,2,4,3)) > > One possible solution is to somehow merge the two tuples according > to a "rule." One rule that works is "if 3 of the atom indices are the > same, > then combine into one tuple." > > However, the rule needs a bit of modification for more complicated > cases (higher symmetry). > > Consider > > smiles2 = 'ClC(Cl)CCl(Cl)(Cl) > > My goal is to get 2 tuples of 4 atoms for smiles2 > > smiles2 is somewhat tricky because there are either > 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) > tuples depending on how you choose your 3 atom indices. > > Again, if my goal is to get 2 tuples, then I need to somehow > pick the largest group, i.e., 2 groups of 3 tuples to do the merge > operation which will give me 2 remaining groups (desired). > > I have already checked stackoverflow and a few other places > for PYTHON code to do the necessary merging, but I could not > find anything specific and appropriate. > > I would be most grateful if anyone has ideas how to do this. I > suspect the answer is a few lines of well-written PYTHON code, > and not modifying the SMARTS (I could be mistaken!). > > Thank you. > > Regards, > Jim Metz > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
You can use Chem.CanonicalRankAtoms to de-duplicate the SMARTS matches based upon the atom symmetry like this: def count_unique_substructures(smiles, smarts): mol = Chem.MolFromSmiles(smiles) ranks = list(Chem.CanonicalRankAtoms(mol, breakTies=False)) pattern = Chem.MolFromSmarts(smarts) unique_sets_of_atoms = set() for match in mol.GetSubstructMatches(pattern): match_ranks = frozenset([ranks[idx] for idx in match]) unique_sets_of_atoms.add(match_ranks) return len(unique_sets_of_atoms) However, this returns 1 for each of your cases. It's not clear to me why you would want your 2nd case to return 2 as all paths from a chlorine to a chlorine through 2 carbons are symmetric. >>> SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' >>> smiles1 = 'ClC(Cl)CCl' >>> smiles2 = 'ClC(Cl)C(Cl)(Cl)(Cl)' >>> count_unique_substructures(smiles1, SMARTS) 1 >>> count_unique_substructures(smiles2, SMARTS) 1 -Brian On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > RDkit Discussion Group, > > I have written a SMARTS to detect vicinal chlorine groups > using RDkit. There are 4 atoms involved in a vicinal chlorine group. > > SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' > > I am trying to count the number of ("unique") occurrences of this > pattern. > > For some molecules with symmetry, this results in > over-counting. > > For the molecule, smiles1 below, I want to obtain > a count of 1 i.e., 1 tuple of 4 atoms. > > smiles1 = 'ClC(Cl)CCl' > > However, using the SMARTS above, I obtain 2 tuples of 4 atoms. > Beginning with a MOL file representation of smiles1, I get > > ((1,2,4,3), (0,2,4,3)) > > One possible solution is to somehow merge the two tuples according > to a "rule." One rule that works is "if 3 of the atom indices are the > same, > then combine into one tuple." > > However, the rule needs a bit of modification for more complicated > cases (higher symmetry). > > Consider > > smiles2 = 'ClC(Cl)CCl(Cl)(Cl) > > My goal is to get 2 tuples of 4 atoms for smiles2 > > smiles2 is somewhat tricky because there are either > 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) > tuples depending on how you choose your 3 atom indices. > > Again, if my goal is to get 2 tuples, then I need to somehow > pick the largest group, i.e., 2 groups of 3 tuples to do the merge > operation which will give me 2 remaining groups (desired). > > I have already checked stackoverflow and a few other places > for PYTHON code to do the necessary merging, but I could not > find anything specific and appropriate. > > I would be most grateful if anyone has ideas how to do this. I > suspect the answer is a few lines of well-written PYTHON code, > and not modifying the SMARTS (I could be mistaken!). > > Thank you. > > Regards, > Jim Metz > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
I think you probably used a slightly different SMILES than the one you showed. The one you showed should have given ((0,1,3,4),(2,1,3,4)). The proper merge rule would then be to consider all matches equivalent if the 2nd and 3rd atom in the match agree, in any order; i.e, the two carbons, indices 1 and 3 in this case. So to do this, for each molecule, do something like this: d = dict{} for match in matches: t = (match[1], match[2]) if match[1] < match[2] ): t = (match[1], match[2]) else: t = (match[2], match[1]) d[t] = match You will wind up with as many dictionary elements as there are matches. -P. On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > RDkit Discussion Group, > > I have written a SMARTS to detect vicinal chlorine groups > using RDkit. There are 4 atoms involved in a vicinal chlorine group. > > SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' > > I am trying to count the number of ("unique") occurrences of this > pattern. > > For some molecules with symmetry, this results in > over-counting. > > For the molecule, smiles1 below, I want to obtain > a count of 1 i.e., 1 tuple of 4 atoms. > > smiles1 = 'ClC(Cl)CCl' > > However, using the SMARTS above, I obtain 2 tuples of 4 atoms. > Beginning with a MOL file representation of smiles1, I get > > ((1,2,4,3), (0,2,4,3)) > > One possible solution is to somehow merge the two tuples according > to a "rule." One rule that works is "if 3 of the atom indices are the > same, > then combine into one tuple." > > However, the rule needs a bit of modification for more complicated > cases (higher symmetry). > > Consider > > smiles2 = 'ClC(Cl)CCl(Cl)(Cl) > > My goal is to get 2 tuples of 4 atoms for smiles2 > > smiles2 is somewhat tricky because there are either > 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) > tuples depending on how you choose your 3 atom indices. > > Again, if my goal is to get 2 tuples, then I need to somehow > pick the largest group, i.e., 2 groups of 3 tuples to do the merge > operation which will give me 2 remaining groups (desired). > > I have already checked stackoverflow and a few other places > for PYTHON code to do the necessary merging, but I could not > find anything specific and appropriate. > > I would be most grateful if anyone has ideas how to do this. I > suspect the answer is a few lines of well-written PYTHON code, > and not modifying the SMARTS (I could be mistaken!). > > Thank you. > > Regards, > Jim Metz > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Python code to merge tuples from a SMARTS match
RDkit Discussion Group, I have written a SMARTS to detect vicinal chlorine groups using RDkit. There are 4 atoms involved in a vicinal chlorine group. SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' I am trying to count the number of ("unique") occurrences of this pattern. For some molecules with symmetry, this results in over-counting. For the molecule, smiles1 below, I want to obtain a count of 1 i.e., 1 tuple of 4 atoms. smiles1 = 'ClC(Cl)CCl' However, using the SMARTS above, I obtain 2 tuples of 4 atoms. Beginning with a MOL file representation of smiles1, I get ((1,2,4,3), (0,2,4,3)) One possible solution is to somehow merge the two tuples according to a "rule." One rule that works is "if 3 of the atom indices are the same, then combine into one tuple." However, the rule needs a bit of modification for more complicated cases (higher symmetry). Consider smiles2 = 'ClC(Cl)CCl(Cl)(Cl) My goal is to get 2 tuples of 4 atoms for smiles2 smiles2 is somewhat tricky because there are either 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) tuples depending on how you choose your 3 atom indices. Again, if my goal is to get 2 tuples, then I need to somehow pick the largest group, i.e., 2 groups of 3 tuples to do the merge operation which will give me 2 remaining groups (desired). I have already checked stackoverflow and a few other places for PYTHON code to do the necessary merging, but I could not find anything specific and appropriate. I would be most grateful if anyone has ideas how to do this. I suspect the answer is a few lines of well-written PYTHON code, and not modifying the SMARTS (I could be mistaken!). Thank you. Regards, Jim Metz -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss