Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-16 Thread Eloy FĂ©lix
Hi Lewis, SureChEMBL is getting its structures from: - USPTO attached molfiles (deposited structures) - names using tools including OPSIN, ChemAxon, Lexichem, ACD. - images using tools including OSRA, imago, CLiDE. As Nicolas points out, issues like this one can occur when auto generating

Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-16 Thread Nicolas Bosc
Hi Lewis, Currently structures are generated automatically in SureChEMBL so this kind of error unfortunately happens… My colleagues will address this issue as soon as possible. Cheers, Nicolas --- Dr Nicolas Bosc Data Mining and Analysis Scientist

Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-15 Thread Lewis Martin
Thanks a lot Greg! That is indeed very helpful. Just to know that the molecule is odd is helpful too. The mol blocks appear to be V2000 format and have names like "Mrv0541 03021215572D" which says ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would use such a representation (it

Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-14 Thread Greg Landrum
Hi Lewis, Dealing with all the strange chemical representations that show up "in the wild" is an ongoing struggle. Your first example is pretty clearly intended to be an azide and we can certainly add a rule to normalize that one to what the RDKit expects it to be (there already is a rule for

[Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-14 Thread Lewis Martin
Hi All, Reading molecules from a bulk download of SureChEMBL, I come across a fair few molecules that fail to parse. Not sure whether they SHOULD parse or not. Here is an example: https://www.surechembl.org/chemical/SCHEMBL386 with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1