SureChEMBL is getting its structures from:
- USPTO attached molfiles (deposited structures)
- names using tools including OPSIN, ChemAxon, Lexichem, ACD.
- images using tools including OSRA, imago, CLiDE.
As Nicolas points out, issues like this one can occur when auto generating
Currently structures are generated automatically in SureChEMBL so this kind of
error unfortunately happens…
My colleagues will address this issue as soon as possible.
Dr Nicolas Bosc
Data Mining and Analysis Scientist
Thanks a lot Greg! That is indeed very helpful.
Just to know that the molecule is odd is helpful too. The mol blocks appear
to be V2000 format and have names like "Mrv0541 03021215572D" which says
ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would use such a
Dealing with all the strange chemical representations that show up "in the
wild" is an ongoing struggle.
Your first example is pretty clearly intended to be an azide and we can
certainly add a rule to normalize that one to what the RDKit expects it to
be (there already is a rule for
Reading molecules from a bulk download of SureChEMBL, I come across a fair
few molecules that fail to parse. Not sure whether they SHOULD parse or
Here is an example: https://www.surechembl.org/chemical/SCHEMBL386
with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
Mail list logo