Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-08 Thread Greg Landrum
On Thu, Nov 9, 2017 at 6:32 AM, Brian Cole  wrote:

> Hi Cheminformaticians,
>
> This is an extreme subtlety in the interpretation of SMILES atom
> stereochemistry and I think a bug in RDKit. Specifically, I think the
> following SMILES should be the same molecule:
>
> >>> rdkit.__version__
> '2017.09.1'
> >>> Chem.CanonSmiles('F[C@@]1(C)CCO1')
> 'C[C@]1(F)CCO1'
> >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1')
> 'C[C@@]1(F)CCO1'
>

As was discussed in the comments of
https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that
the second syntax is even legal. But that's a side point.

Since there is no hydrogen inside the stereo carbon atom block the bond
> being 'looked down' should be the first atom encountered. In both cases
> above, that should be the Florine, therefore the molecules should be
> equivalent.
>

Agreed, and this is a view that's further supported by this behavior:

In [2]: Chem.CanonSmiles('F[C@@]1(C)CCO1')
Out[2]: 'C[C@]1(F)CCO1'

In [3]: Chem.CanonSmiles('F[C@@](C)1CCO1')
Out[3]: 'C[C@@]1(F)CCO1'

Would you mind filing a bug for this and I'll try to track it down/fix it?

Thanks,
-greg



>
> Though it could be argued the 2nd one is not strict SMILES as Andrew
> describes here: https://github.com/rdkit/rdkit/issues/786
>
> It is useful when recombining fragments with ring closure digits for these
> to be equivalent:
> [*][C@]1(C)CCO1
> [C@]([*])1(C)CCO1
>
> Also, every other tool I can get my hands on agrees they're the same:
> OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough
> canonicalization example for me to work from.)
>
> Sure wish there was a SMILES validation test suite we could all run
> against. And so I'm attaching the examples I used to verify the above so
> whatever poor soul assigned that task later can find this on Google. (I'm
> hopeful :-)
>
> Thanks,
> Brian
>
> PS: the current output from the script:
>
> $ python stereo_handling_first_atom.py
> RDKit = 2017.09.1
> OEChem = 2.1.2
> OpenBabel = 2.4.1
> indigo = 1.2.3.r0-g98188eb mac10.7
> RDKit failed to recognize these as the same:
> [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2]
> [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2]
> OpenBabel failed to recognize these as the same:
> Cl[S@](C)=O -> C[S@](=O)Cl
> [S@](Cl)(C)=O -> C[S@@](=O)Cl
> Indigo failed to recognize these as the same:
> Cl[S@](C)=O -> C[S@](=O)Cl
> [S@](Cl)(C)=O -> C[S@@](=O)Cl
> OpenBabel failed to recognize these as the same:
> Cl[S@](C)= -> =[S@](Cl)C
> [S@](Cl)(C)= -> =[S@@](Cl)C
> Indigo failed to recognize these as the same:
> Cl[S@](C)= -> =[S@@](C)Cl
> [S@](Cl)(C)= -> =[S@](C)Cl
> RDKit failed to recognize these as the same:
> Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1
> [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1
> RDKit failed to recognize these as the same:
> Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
> RDKit failed to recognize these as the same:
> Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
> RDKit failed to recognize these as the same:
> Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21
> [C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21
> RDKit failed to recognize these as the same:
> [*][C@@H]1CO1 -> [*][C@@H]1CO1
> [C@H]([*])1CO1 -> [*][C@H]1CO1
> RDKit failed to recognize these as the same:
> [*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1
> [C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1
> RDKit failed to recognize these as the same:
> F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1
> [C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1
> RDKit failed to recognize these as the same:
> Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@H](Cl)[C@H]1Cl
> [C@H](Cl)1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@@H](Cl)[C@H]1Cl
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS for Joback and Reid method

2017-11-08 Thread Jason Biggs
Chenyang,
I haven't looked at your smarts strings yet, but I do have this list of
SMARTS strings for the joback method I compiled myself (for use here:
https://www.wolframalpha.com/input/?i=2,3-methano-5,6-dichloroindene=3
).

Perhaps this can be of use.  If you spot any mistakes, please let me know

Jason

$JobackSubstructures={

{"Methyl","-CH3", "[CX4H3]"},

{"SecondaryAcyclic", "-CH2-", "[!R;CX4H2]"},

{"TertiaryAcyclic",">CH-", "[!R;CX4H]"},

{"QuaternaryAcyclic", ">C<", "[!R;CX4H0]"},

{"PrimaryAlkene", "=CH2", "[CX3H2]"},

{"SecondaryAlkeneAcyclic", "=CH-", "[!R;CX3H1;!$([CX3H1](=O))]"},

{"TertiaryAlkeneAcyclic", "=C<", "[$([!R;#6X3H0]);!$([!R;#6X3H0]=[#8])]"},

{"CumulativeAlkene", "=C=", "[$([CX2H0](=*)=*)]"},

{"TerminalAlkyne", "\[Congruent]CH","[$([CX2H1]#[!#7])]"},

{"InternalAlkyne","\[Congruent]C-","[$([CX2H0]#[!#7])]"},

{"SecondaryCyclic", "-CH2- (ring)", "[R;CX4H2]"},

{"TertiaryCyclic", ">CH- (ring)", "[R;CX4H]"},

{"QuaternaryCyclic", ">C< (ring)", "[R;CX4H0]"},

{"SecondaryAlkeneCyclic", "=CH- (ring)", "[R;CX3H1,cX3H1]"},

{"TertiaryAlkeneCyclic", "=C<
(ring)","[$([R;#6X3H0]);!$([R;#6X3H0]=[#8])]"},

{"Fluoro", "-F", "[F]"},

{"Chloro", "-Cl", "[Cl]"},

{"Bromo", "-Br", "[Br]"},

{"Iodo", "-I", "[I]"},

{"Alcohol","-OH", "[OX2H;!$([OX2H]-[#6]=[O]);!$([OX2H]-a)]"},(* alcohol -
not matching a carboxylic acid *)

{"Phenol","-OH", "[$([OX2H]-a)]"},

{"EtherAcyclic", "-O-", "[OX2H0;!R;!$([OX2H0]-[#6]=[#8])]"},

{"EtherCyclic", "-O- (ring)", "[#8X2H0;R;!$([#8X2H0]~[#6]=[#8])]"},

{"CarbonylAcyclic", ">C=O",
"[$([CX3H0](=[OX1]));!$([CX3](=[OX1])-[OX2]);!R]=O"},

{"CarbonylCyclic", ">C=O
(ring)","[$([#6X3H0](=[OX1]));!$([#6X3](=[#8X1])~[#8X2]);R]=O"},

{"Aldehyde","O=CH-","[CX3H1](=O)"},

{"CarboxylicAcid", "COOH", "[OX2H]-[C]=O"},

{"Ester", "-C(=O)O-", "[#6X3H0;!$([#6X3H0](~O)(~O)(~O))](=[#8X1])[#8X2H0]"},

{"OxygenDoubleBondOther", "=O",
"[OX1H0;!$([OX1H0]~[#6X3]);!$([OX1H0]~[#7X3]~[#8])]"},

{"PrimaryAmino","NH2", "[NX3H2]"},

{"SecondaryAminoAcyclic",">NH", "[NX3H1;!R]"},

{"SecondaryAminoCyclic",">NH (ring)", "[#7X3H1;R]"},

{"TertiaryAmino", ">N-","[#7X3H0;!$([#7](~O)~O)]"}, (* Tertiary amine
except nitro group *)

{"ImineCyclic","=N- (ring)","[#7X2H0;R]"},

{"ImineAcyclic","=N-","[#7X2H0;!R]"},

{"Aldimine", "=NH", "[#7X2H1]"},

{"Cyano", "-C\[Congruent]N","[#6X2]#[#7X1H0]"},

{"Nitro", "NO2", "[$([#7X3,#7X3+][!#8])](=[O])~[O-]"},

{"Thiol", "-SH", "[SX2H]"},

{"ThioetherAcyclic", "-S-", "[#16X2H0;!R]"},

{"ThioetherCyclic", "-S- (ring)", "[#16X2H0;R]"}

};

Jason Biggs


On Wed, Nov 8, 2017 at 4:52 PM, Chenyang Shi  wrote:

> Hi everyone,
>
> I have been recently working on a project that implements Joback method
> using RDKit (https://en.wikipedia.org/wiki/Joback_method).
>
> I believe the core to the success of this project is to make the 41
> functional groups correctly represented by SMARTS code. I have compiled my
> own codes, see attachment. I would appreciate your review of it and let me
> know if you spot errors.
>
> I think building a robust/well-tested SMARTS database (though small in my
> case) would be helpful to others and other projects.
>
> Thank you,
> Chenyang
>
> PS: The ones highlighted red in the document are robust.
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 121, Issue 15

2017-11-08 Thread JW Feng via Rdkit-discuss
The Daylight website is a very good resource for SMILES, SMARTS, and
SMIRKS.

http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

JW

___
JW Feng, Ph.D.
Denali Therapeutics Inc.
151 Oyster Point Blvd, 2nd Floor, South San Francisco, CA 94080 | (650)
270-0628

On Wed, Nov 8, 2017 at 2:52 PM,  wrote:

> Send Rdkit-discuss mailing list submissions to
> rdkit-discuss@lists.sourceforge.net
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> or, via email, send a message with subject or body 'help' to
> rdkit-discuss-requ...@lists.sourceforge.net
>
> You can reach the person managing the list at
> rdkit-discuss-ow...@lists.sourceforge.net
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Rdkit-discuss digest..."
>
>
> Today's Topics:
>
>1. SMARTS for =C=, #CH, #C- (Chenyang Shi)
>2. Re: SMARTS for =C=, #CH, #C- (Andrew Dalke)
>3. Re: SMARTS for =C=, #CH, #C- (Chenyang Shi)
>4. SMARTS for Joback and Reid method (Chenyang Shi)
>
>
> --
>
> Message: 1
> Date: Wed, 8 Nov 2017 14:00:36 -0600
> From: Chenyang Shi 
> To: RDKit Discuss 
> Subject: [Rdkit-discuss] SMARTS for =C=, #CH, #C-
> Message-ID:
>  com>
> Content-Type: text/plain; charset="utf-8"
>
> Dear RDKitters,
>
> I have a question regarding SMARTS codes for three simple functional
> groups, these are =C=, #CH and #C-. I am new to SMARTS/SMILES. I indeed
> tried to guess their codes. Here are my guesses:
>
> =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)]
>
> #CH : [CH1;A;X2;!R]#[$(*)]
>
> #C- :  [CH0;A;X2;!R]#[$(*)]
>
> I checked these SMARTS at
> http://smartsview.zbh.uni-hamburg.de/smartsview/calculate?method=get; they
> all seem make sense.
>
> For example, the webpage prints out following messages:
>
> =C=: it says "aliphatic C with 0 further total connections, with 0 further
> hydrogen, not in a ring".
>
> #CH: "aliphatic C with 0 further total connections, with 1 further
> hydrogen, not in a ring".
>
> #C-: "aliphatic C with 1 further total connections, with 0 further
> hydrogen, not in a ring".
>
> However, when I search subgroups using these SMARTS, I had problems.
>
> For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]",
> >>> from rdkit import Chem
> >>> m = Chem.MolFromSmiles('C=C=O')
> >>>
> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=[$(*)])=[$(*)]"))
> ((1, 0, 2),)
>
> it prints out atomic positions 1, 0, 2--three positions. But I would expect
> only one position for the Carbon in the middle.
>
> Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]",
> >>> from rdkit import Chem
> >>> m = Chem.MolFromSmiles('C#C')
> >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]"))
> ((0, 1),)
> I would expect two separate positions such as (0,), (1,), indicating there
> are two carbon triple bonds (with an hydrogen).
>
>
> Then if  if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]",
> >>> from rdkit import Chem
> >>> m = Chem.MolFromSmiles('CC#CC')
> >>> m.GetSubstructMatches(Chem.MolFromSmarts(" [CH0;A;X2;!R]#[$(*)]"))
> ((1, 2),)
> Again, I would expect two separate positions such as (1,), (2,), indicating
> two carbon triple bonds.
>
> I think the problem might be my SMARTS for these three groups are not
> SPECIFIC. I would appreciate everyone's help on this.
>
> Cheers,
> Chenyang
> -- next part --
> An HTML attachment was scrubbed...
>
> --
>
> Message: 2
> Date: Wed, 8 Nov 2017 21:27:29 +0100
> From: Andrew Dalke 
> Cc: RDKit Discuss 
> Subject: Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-
> Message-ID: <8478f1ae-4916-4feb-8e67-e6cf4e52f...@dalkescientific.com>
> Content-Type: text/plain; charset=us-ascii
>
> On Nov 8, 2017, at 21:00, Chenyang Shi  wrote:
> > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)]
>
> The recursive SMARTS notation, which is the term inside of the [$(...)],
> finds a match for the entire pattern and returns the first atom in that
> pattern.
>
> > For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]",
> > >>> from rdkit import Chem
> > >>> m = Chem.MolFromSmiles('C=C=O')
> > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=
> [$(*)])=[$(*)]"))
> > ((1, 0, 2),)
> >
> > it prints out atomic positions 1, 0, 2--three positions. But I would
> expect only one position for the Carbon in the middle.
>
> The $(*) finds the pattern, which is a "*" and in this case the terminal
> carbons, and returns it. The substructure search returns 3 positions
> because the first is [CH0;A;X2;!R], the second is the first atom of 

Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-

2017-11-08 Thread Chenyang Shi
Dear Andy,

Thank you for a quick and thorough email. I find it very instructional,
although I need to read it a couple times more to digest it.

Cheers,
Chenyang

On Wed, Nov 8, 2017 at 2:27 PM, Andrew Dalke 
wrote:

> On Nov 8, 2017, at 21:00, Chenyang Shi  wrote:
> > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)]
>
> The recursive SMARTS notation, which is the term inside of the [$(...)],
> finds a match for the entire pattern and returns the first atom in that
> pattern.
>
> > For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]",
> > >>> from rdkit import Chem
> > >>> m = Chem.MolFromSmiles('C=C=O')
> > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=
> [$(*)])=[$(*)]"))
> > ((1, 0, 2),)
> >
> > it prints out atomic positions 1, 0, 2--three positions. But I would
> expect only one position for the Carbon in the middle.
>
> The $(*) finds the pattern, which is a "*" and in this case the terminal
> carbons, and returns it. The substructure search returns 3 positions
> because the first is [CH0;A;X2;!R], the second is the first atom of "*",
> and the third is the first atom of the other "*".
>
> If you only want the first atom the entire pattern, then put the entire
> pattern in a recursive SMARTS, as in:
>
>   [$([CH0;A;X2;!R](=*)=*)]
>
> >>> pat = Chem.MolFromSmarts("[$([CH0;A;X2;!R](=*)=*)]")
> >>> mol = Chem.MolFromSmiles('C=C=O')
> >>> mol.GetSubstructMatches(pat)
> ((1,),)
>
> > Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]",
> > >>> from rdkit import Chem
> > >>> m = Chem.MolFromSmiles('C#C')
> > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]"))
> > ((0, 1),)
> > I would expect two separate positions such as (0,), (1,), indicating
> there are two carbon triple bonds (with an hydrogen).
>
> Since you are only looking for a single atom, try putting the entire
> pattern in a recursive SMARTS, as in
>
>   [$([CH1;A;X2;!R]#*)]
>
> >>> mol = Chem.MolFromSmiles("C#C")
> >>> pat = Chem.MolFromSmarts("[$([CH1;A;X2;!R]#*)]")
> >>> mol.GetSubstructMatches(pat)
> ((0,), (1,))
>
>
> > Then if  if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]",
>
> I believe you want "[$([CH0;A;X2;!R]#*)]"
>
> Thank you for your clear description of what you expected.
>
> Cheers,
>
> Andrew
> da...@dalkescientific.com
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-

2017-11-08 Thread Andrew Dalke
On Nov 8, 2017, at 21:00, Chenyang Shi  wrote:
> =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)] 

The recursive SMARTS notation, which is the term inside of the [$(...)], finds 
a match for the entire pattern and returns the first atom in that pattern.

> For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]", 
> >>> from rdkit import Chem
> >>> m = Chem.MolFromSmiles('C=C=O')
> >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=[$(*)])=[$(*)]"))
> ((1, 0, 2),)
> 
> it prints out atomic positions 1, 0, 2--three positions. But I would expect 
> only one position for the Carbon in the middle.

The $(*) finds the pattern, which is a "*" and in this case the terminal 
carbons, and returns it. The substructure search returns 3 positions because 
the first is [CH0;A;X2;!R], the second is the first atom of "*", and the third 
is the first atom of the other "*".

If you only want the first atom the entire pattern, then put the entire pattern 
in a recursive SMARTS, as in:

  [$([CH0;A;X2;!R](=*)=*)]

>>> pat = Chem.MolFromSmarts("[$([CH0;A;X2;!R](=*)=*)]")
>>> mol = Chem.MolFromSmiles('C=C=O')
>>> mol.GetSubstructMatches(pat)
((1,),)

> Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]", 
> >>> from rdkit import Chem
> >>> m = Chem.MolFromSmiles('C#C')
> >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]"))
> ((0, 1),)
> I would expect two separate positions such as (0,), (1,), indicating there 
> are two carbon triple bonds (with an hydrogen).

Since you are only looking for a single atom, try putting the entire pattern in 
a recursive SMARTS, as in

  [$([CH1;A;X2;!R]#*)]

>>> mol = Chem.MolFromSmiles("C#C")
>>> pat = Chem.MolFromSmarts("[$([CH1;A;X2;!R]#*)]")
>>> mol.GetSubstructMatches(pat)
((0,), (1,))


> Then if  if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]", 

I believe you want "[$([CH0;A;X2;!R]#*)]"

Thank you for your clear description of what you expected.

Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] SMARTS for =C=, #CH, #C-

2017-11-08 Thread Chenyang Shi
Dear RDKitters,

I have a question regarding SMARTS codes for three simple functional
groups, these are =C=, #CH and #C-. I am new to SMARTS/SMILES. I indeed
tried to guess their codes. Here are my guesses:

=C= : [CH0;A;X2;!R](=[$(*)])=[$(*)]

#CH : [CH1;A;X2;!R]#[$(*)]

#C- :  [CH0;A;X2;!R]#[$(*)]

I checked these SMARTS at
http://smartsview.zbh.uni-hamburg.de/smartsview/calculate?method=get; they
all seem make sense.

For example, the webpage prints out following messages:

=C=: it says "aliphatic C with 0 further total connections, with 0 further
hydrogen, not in a ring".

#CH: "aliphatic C with 0 further total connections, with 1 further
hydrogen, not in a ring".

#C-: "aliphatic C with 1 further total connections, with 0 further
hydrogen, not in a ring".

However, when I search subgroups using these SMARTS, I had problems.

For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]",
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('C=C=O')
>>>
m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=[$(*)])=[$(*)]"))
((1, 0, 2),)

it prints out atomic positions 1, 0, 2--three positions. But I would expect
only one position for the Carbon in the middle.

Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]",
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('C#C')
>>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]"))
((0, 1),)
I would expect two separate positions such as (0,), (1,), indicating there
are two carbon triple bonds (with an hydrogen).


Then if  if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]",
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('CC#CC')
>>> m.GetSubstructMatches(Chem.MolFromSmarts(" [CH0;A;X2;!R]#[$(*)]"))
((1, 2),)
Again, I would expect two separate positions such as (1,), (2,), indicating
two carbon triple bonds.

I think the problem might be my SMARTS for these three groups are not
SPECIFIC. I would appreciate everyone's help on this.

Cheers,
Chenyang
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match

2017-11-08 Thread James T. Metz via Rdkit-discuss


Brian, Greg, and David,


Thank you for your suggestions.  I will try to respond to your questions 
and comments:


I am trying to reproduce results from a literature paper that used 
non-PYTHON and non-RDkit
code to identify certain patterns in molecules as part of a group contribution 
scheme resulting in the
prediction of thermodynamic quantities.  I have a training set of molecules and 
the results of calculations
for that training set (individual counts of groups of atoms and resulting 
energies).  Hence, my first goal 
is to reproduce the results reported for that training set, but using PYTHON 
and RDkit.  Since my goal 
is to reproduce literature results as closely as possible, I am not in a 
position to debate the logic of the 
original authors in their assignments of SMARTS/SMILES matching and counts.


After this initial goal is met, I might consider alternative pattern 
matching and counting schemes and

compare those results to the literature results.  In fact, that would be good 
science.


As I mentioned in my first email on this topic, I do think I have come up 
with a "rule" that will give me

the correct answer (I have tried it for 8 cases using pencil and paper), my 
challenge is to code up the
"rule" in PYTHON.  I am a beginner at PYTHON, so I am struggling to get this 
idea into functional, bug-free
code.  Peter Shenkin's idea/code is getting close to what needs to be done, but 
doesn't handle all the cases.


Regards,

Jim Metz




-Original Message-
From: Brian Cole 
To: James T. Metz 
Cc: RDKit Discuss 
Sent: Tue, Nov 7, 2017 7:23 pm
Subject: Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match



You can use Chem.CanonicalRankAtoms to de-duplicate the SMARTS matches based 
upon the atom symmetry like this: 



def count_unique_substructures(smiles, smarts):
mol = Chem.MolFromSmiles(smiles)
ranks = list(Chem.CanonicalRankAtoms(mol, breakTies=False))
pattern = Chem.MolFromSmarts(smarts)

unique_sets_of_atoms = set()
for match in mol.GetSubstructMatches(pattern):
match_ranks = frozenset([ranks[idx] for idx in match])
unique_sets_of_atoms.add(match_ranks)

return len(unique_sets_of_atoms)



However, this returns 1 for each of your cases. It's not clear to me why you 
would want your 2nd case to return 2 as all paths from a chlorine to a chlorine 
through 2 carbons are symmetric. 



>>> SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]'
>>> smiles1 = 'ClC(Cl)CCl'
>>> smiles2 = 'ClC(Cl)C(Cl)(Cl)(Cl)'

>>> count_unique_substructures(smiles1, SMARTS)

1
>>> count_unique_substructures(smiles2, SMARTS)
1


-Brian







On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss 
 wrote:

RDkit Discussion Group,




I have written a SMARTS to detect vicinal chlorine groups

using RDkit.  There are 4 atoms involved in a vicinal chlorine group.


SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]'


I am trying to count the number of ("unique") occurrences of this

pattern.


For some molecules with symmetry, this results in

over-counting.
   

For the molecule, smiles1 below, I want to obtain

a count of 1 i.e., 1 tuple of 4 atoms.


smiles1 = 'ClC(Cl)CCl'



However, using the SMARTS above, I obtain 2 tuples of 4 atoms.  
Beginning with a MOL file representation of smiles1, I get


((1,2,4,3), (0,2,4,3))



One possible solution is to somehow merge the two tuples according 

to a "rule."  One rule that works is "if 3 of the atom indices are the same, 
then combine into one tuple."


However, the rule needs a bit of modification for more complicated
cases (higher symmetry).


Consider



smiles2 = 'ClC(Cl)CCl(Cl)(Cl)



My goal is to get 2 tuples of 4 atoms for smiles2



smiles2 is somewhat tricky because there are either

2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom)
tuples depending on how you choose your 3 atom indices.


Again, if my goal is to get 2 tuples, then I need to somehow

pick the largest group, i.e., 2 groups of 3 tuples to do the merge 
operation which will give me 2 remaining groups (desired).


I have already checked stackoverflow and a few other places

for PYTHON code to do the necessary merging, but I could not
find anything specific and appropriate.


I would be most grateful if anyone has ideas how to do this.  I

suspect the answer is a few lines of well-written PYTHON code, 
and not modifying the SMARTS (I could be mistaken!).


Thank you.



Regards,

Jim Metz





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match

2017-11-08 Thread James T. Metz via Rdkit-discuss
Peter,


Thank you for your suggestions and accompanying code.


I have modified your code slightly and have created 3 tuples
for testing.  Your code works for tuples, match1 and match2, but
does not work for match3.  The code should return a 2 for match3,
because there are 2 sets of 3 tuples each containing 4 atom indices.
Using my "rule" that, "if 3 indices are the same, they are in one group 
and one must form the groups of the largest possible size", one arrives
at 2 groups.  The merge function should then select one tuple from
each group, resulting in a count of 2 (for the final number of groups).


Keep in mind that I will not know how many groups of tuples will be

created for any given molecule.  Hence, I can not use hard coded array
indices.


Any ideas how to modify the code below to obtain the desired result

for tuple, match3, and how to deal with tuples of various sizes?


Regards,

Jim Metz








def merge2(matches):
if len(matches) > 1:
d = {}
for match in matches:
t = (matches[0], matches[1])
if (matches[0] < matches[1]):
t = (matches[0], matches[1])
else:
t = (matches[1], matches[0])
d[t] = match
merged_match = (d[t],)
else:
merged_match = matches

count = len(merged_match)
return(count)


match1 = ((0,2,3,4),)
match2 = ((0,2,3,4), (1,2,3,4))
match3 = ((0,2,4,5), (1,2,5,6), (2,3,4,5), (2,3,5,6), (0,2,5,6), (1,2,4,5))
matches = match2   # Change the number to test different tuples


output = merge2(matches)
print("Output is   ", output)










-Original Message-
From: Peter S. Shenkin 
To: James T. Metz 
Cc: RDKit Discuss 
Sent: Tue, Nov 7, 2017 7:05 pm
Subject: Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match



I think you probably used a slightly different SMILES than the one you showed. 
The one you showed should have given ((0,1,3,4),(2,1,3,4)).


The proper merge rule would then be to consider all matches equivalent if the 
2nd and 3rd atom in the match agree, in any order; i.e, the two carbons, 
indices 1 and 3 in this case. 


So to do this, for each molecule, do something like this:


d = dict{}
for match in matches:
t = (match[1], match[2])
if match[1] < match[2] ):
t = (match[1], match[2])
else:
t = (match[2], match[1])
d[t] = match


You will wind up with as many dictionary elements as there are matches.


-P.
 



On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss 
 wrote:

RDkit Discussion Group,




I have written a SMARTS to detect vicinal chlorine groups

using RDkit.  There are 4 atoms involved in a vicinal chlorine group.


SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]'


I am trying to count the number of ("unique") occurrences of this

pattern.


For some molecules with symmetry, this results in

over-counting.
   

For the molecule, smiles1 below, I want to obtain

a count of 1 i.e., 1 tuple of 4 atoms.


smiles1 = 'ClC(Cl)CCl'



However, using the SMARTS above, I obtain 2 tuples of 4 atoms.  
Beginning with a MOL file representation of smiles1, I get


((1,2,4,3), (0,2,4,3))



One possible solution is to somehow merge the two tuples according 

to a "rule."  One rule that works is "if 3 of the atom indices are the same, 
then combine into one tuple."


However, the rule needs a bit of modification for more complicated
cases (higher symmetry).


Consider



smiles2 = 'ClC(Cl)CCl(Cl)(Cl)



My goal is to get 2 tuples of 4 atoms for smiles2



smiles2 is somewhat tricky because there are either

2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom)
tuples depending on how you choose your 3 atom indices.


Again, if my goal is to get 2 tuples, then I need to somehow

pick the largest group, i.e., 2 groups of 3 tuples to do the merge 
operation which will give me 2 remaining groups (desired).


I have already checked stackoverflow and a few other places

for PYTHON code to do the necessary merging, but I could not
find anything specific and appropriate.


I would be most grateful if anyone has ideas how to do this.  I

suspect the answer is a few lines of well-written PYTHON code, 
and not modifying the SMARTS (I could be mistaken!).


Thank you.



Regards,

Jim Metz





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss






--
Check out the vibrant tech community 

[Rdkit-discuss] RPM distros

2017-11-08 Thread Tim Dudgeon
There is mention of RPM distributions of RDKit 
(https://copr.fedorainfracloud.org/coprs/giallu/rdkit/).


But on trying these:

1. the distro is based on the 2017_03_1 release
2. it fails due to missing libinchi.so.1 dependency.

This is presumably no longer being maintained?
Anything that can be done to help with fixing this?

Tim


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss