Re: [Rdkit-discuss] Error parsing a MUTAG smiles

2020-03-04 Thread Brian Cole
Note, the location of the first opening parenthesis is different:

>>> 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'.find('(')
13
>>> 'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'.find('(')
15

So the SMILES are syntactically correct to represent 2 and 3
nitrocarbazole, though semantically weird as they're a mixture of
kekule and aromatic smiles, would expect something of either of these
two forms, pulled from the pubchem website:


In [3]: Chem.CanonSmiles('C1=CC=C2C(=C1)C3=C(N2)C=CC(=C3)[N+](=O)[O-]')
Out[3]: 'O=[N+]([O-])c1ccc2[nH]c3c3c2c1'

In [4]: Chem.CanonSmiles('C1=CC=C2C(=C1)C3=C(N2)C=C(C=C3)[N+](=O)[O-]')
Out[4]: 'O=[N+]([O-])c1ccc2c(c1)[nH]c1c12'

The original 2 smiles seem to be trying to represent a protonation
state with the nitrogen deprotonated, causing RDKit to think carbon 5
to have a valence of 5: '=NC3=':


In [5]: Chem.CanonSmiles('c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]')
[18:24:50] Explicit valence for atom # 10 C, 5, is greater than permitted


This looks like a disagreement with another toolkit about how to parse
this SMILES:


In [6]: from openeye.oechem import *

In [7]: mol = OEMol()

In [8]: OESmilesToMol(mol, 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]')
Out[8]: True

In [9]: OEMolToSmiles(mol)
Out[9]: 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'

In [10]: for bond in mol.GetAtom(OEHasAtomIdx(5)).GetBonds():
...: print(bond.GetOrder())
...:
1
1
2

-Brian


On Wed, Mar 4, 2020 at 6:04 PM Lewis Martin 
wrote:

> Hi all,
> Im chasing up a small puzzle on parsing SMILES codes if anyone's
> interested, but its not directly RDkit. I was looking at the molecules in
> the MUTAG dataset, which is commonly used in graph learning research.
> Mostly these are just shared as graphs (i.e. vertices and edges) rather
> than SMILES codes, but I did find a list of SMILES at ChemDB:
> http://cdb.ics.uci.edu/cgibin/LearningDatasetsWeb.py
>
> RDKit can't parse two of the SMILES codes - indices 82 and 187 using zero
> indexing - and youll note that they are the same:
>
> smiles[82]
>
> 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'
>
>
> smiles[187]
>
> 'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'
>
>
> I found these in the original paper* as '83 - 2-nitrocarbazoIe' and '188
> - 3-nitrocarbazole' (1-indexed). So the smiles codes should be similar
> but not exactly the same.
>
> Can anyone please tell me if those smiles codes are legit or has there
> been a transcription error? It's easy to replace these with the correct
> codes from pubchem, but if you're familiar with the dataset is it safe to
> trust the other codes? Why does the paper have 197 molecules but the
> dataset only has 188?
>
> Thanks for your time!
>
>
> Pubchem for 2nitrocarbazole:
> https://pubchem.ncbi.nlm.nih.gov/compound/99612#section=InChI
> Pubchem for 3nitrocarbazole:
> https://pubchem.ncbi.nlm.nih.gov/compound/3-Nitrocarbazole
>
>
>
> *original paper is "Structure-activity relationship of mutagenic aromatic
> and heteroaromatic nitro compounds. Correlation with molecular orbital
> energies and hydrophobicity" https://www.ncbi.nlm.nih.gov/pubmed/1995902
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Error parsing a MUTAG smiles

2020-03-04 Thread Lewis Martin
Hi all,
Im chasing up a small puzzle on parsing SMILES codes if anyone's
interested, but its not directly RDkit. I was looking at the molecules in
the MUTAG dataset, which is commonly used in graph learning research.
Mostly these are just shared as graphs (i.e. vertices and edges) rather
than SMILES codes, but I did find a list of SMILES at ChemDB:
http://cdb.ics.uci.edu/cgibin/LearningDatasetsWeb.py

RDKit can't parse two of the SMILES codes - indices 82 and 187 using zero
indexing - and youll note that they are the same:

smiles[82]

'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'


smiles[187]

'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'


I found these in the original paper* as '83 - 2-nitrocarbazoIe' and '188 -
3-nitrocarbazole' (1-indexed). So the smiles codes should be similar but
not exactly the same.

Can anyone please tell me if those smiles codes are legit or has there been
a transcription error? It's easy to replace these with the correct codes
from pubchem, but if you're familiar with the dataset is it safe to trust
the other codes? Why does the paper have 197 molecules but the dataset only
has 188?

Thanks for your time!


Pubchem for 2nitrocarbazole:
https://pubchem.ncbi.nlm.nih.gov/compound/99612#section=InChI
Pubchem for 3nitrocarbazole:
https://pubchem.ncbi.nlm.nih.gov/compound/3-Nitrocarbazole



*original paper is "Structure-activity relationship of mutagenic aromatic
and heteroaromatic nitro compounds. Correlation with molecular orbital
energies and hydrophobicity" https://www.ncbi.nlm.nih.gov/pubmed/1995902
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Are any of the UGM talks recorded?

2020-03-04 Thread Jennifer Wei via Rdkit-discuss
Hi All,

I was wondering if any of the UGM talks are recorded. I'm particularly
interested in this talk

by Robert Sayle from the 2019 UGM.

Thanks!
Jennifer
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss