Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL
Hi Lewis, SureChEMBL is getting its structures from: - USPTO attached molfiles (deposited structures) - names using tools including OPSIN, ChemAxon, Lexichem, ACD. - images using tools including OSRA, imago, CLiDE. As Nicolas points out, issues like this one can occur when auto generating structures from names and images. It is the case for the 2 structures you mention. We have plans to review all the tools we are using to generate the structures as we know about some new ones out there. Cheers, Eloy On Thu, 16 Dec 2021 at 09:28, Nicolas Bosc wrote: > Hi Lewis, > > Currently structures are generated automatically in SureChEMBL so this > kind of error unfortunately happens… > > My colleagues will address this issue as soon as possible. > > Cheers, > Nicolas > --- > Dr Nicolas Bosc > Data Mining and Analysis Scientist > ChEMBL group > EMBL-EBI > Wellcome Genome Campus > Hinxton, Cambridge, CB10 1SD > United Kingdom > > nb...@ebi.ac.uk > +44 1223 492519 > > > On 15 Dec 2021, at 20:42, Lewis Martin wrote: > > Thanks a lot Greg! That is indeed very helpful. > > Just to know that the molecule is odd is helpful too. The mol blocks > appear to be V2000 format and have names like "Mrv0541 03021215572D" > which says ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would > use such a representation (it doesn't look like a faithful transcription > from the source patent). Off-topic, but if anyone happens to have an > insight or connection with SureChEMBL, please do reach out! > > Cheers > Lewis > > > > > On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum > wrote: > >> Hi Lewis, >> >> Dealing with all the strange chemical representations that show up "in >> the wild" is an ongoing struggle. >> >> Your first example is pretty clearly intended to be an azide and we can >> certainly add a rule to normalize that one to what the RDKit expects it to >> be (there already is a rule for C-N=N#N, but that doesn't help here.). That >> won't happen before the next feature release though. >> >> I'm not really sure what the intent was for the two >> four-coordinate neutral Ns in the second molecule, so I think it's unlikely >> that we'd add a standard cleanup for one. >> >> However! The good news is that there's a pretty easy (and efficient) way >> to fix this yourself. We added a new method to chemical reactions in the >> 2021.09 release which allows you to modify a molecule in place (subject to >> some constraints). This is ideal for doing cleanup transformations like >> these. >> >> This gist shows how to write reaction rules for your cases (I guessed for >> what the Ns are supposed to be) and then use them: >> https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb >> >> Hope this helps, >> -greg >> >> >> On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin >> wrote: >> >>> Hi All, >>> Reading molecules from a bulk download of SureChEMBL, I come across a >>> fair few molecules that fail to parse. Not sure whether they SHOULD parse >>> or not. >>> >>> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386 >>> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1 >>> >>> Even reading the SMILES code one can see that there are too many bonds >>> in there - a nitrogen triply bonded and doubly bonded to other atoms. >>> >>> Another example: https://www.surechembl.org/chemical/SCHEMBL33957 >>> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1 >>> >>> Again, valence for a nitrogen is off. >>> >>> Should I expect to parse these with RDKit? Might there be some way >>> around this? It's a significant fraction of the molecules in SureChEMBL. >>> >>> Thanks team! >>> Lewis >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] unique chemical representation
Hi, All the steps of ChEMBL's RDKit based standardiser are explained in the recently published manuscript: https://link.springer.com/article/10.1186/s13321-020-00456-1 Hope it helps! Regards, Eloy On Mon, 14 Sep 2020 at 04:32, Francois Berenger wrote: > On 12/09/2020 00:27, Mike Mazanetz wrote: > > Dear Forum, > > > > I'm curious as to how the community standardizes molecules to generate > > unique chemical representations. > > > > Please let me know what are people's referred means to treat: > > > > * Tautomers > > * Protomers > > * Resonance structures > > * Salts when the salt is larger than the ligand > > Here is how ChEMBL does it: > > https://github.com/chembl/ChEMBL_Structure_Pipeline > > Not sure they handle all the cases you listed, though. > > Regards, > F. > > > Particularly when converting between chemical representations SDF to > > smiles, SMARTS to smiles, and one flavour of smiles to another. > > > > And are there any caveats to consider, such as the correct assignment > > of heterocyclic nitrogens as aromatic ? > > > > I look forward to hearing your thoughts. > > > > Regards, > > > > mike > > ___ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] want advice for good teaching data set
Hi Andrew, If you want to build model I guess that what you want is to get experimental logp values. This should give you something to start with: select ACTIVITY_ID, MOLREGNO, STANDARD_VALUE, STANDARD_TYPE from ACTIVITIES where STANDARD_TYPE = 'LogP' and STANDARD_VALUE is not null and data_validity_comment is null and POTENTIAL_DUPLICATE = 0; Eloy. 2018-08-29 14:51 GMT+01:00 TJ O'Donnell : > Hi Andrew > ChEMBL 24 has compound properties in the table compound_properties. I > think the alogp > is computed using (Crippen) atom types and the acd_logp is uses ACD labs > methods. > TJ > > On Wed, Aug 29, 2018 at 5:52 AM Andrew Dalke > wrote: > >> Hi all, >> >> I am starting to put together materials for the Python/RDKit training >> course I'm giving just before the RDKit UGM next month. >> >> I would like to structure part of it around the SQLite release of the >> ChEMBL data set. More specifically, I plan to include examples of machine >> learning with scikit-learn, using RDKit descriptors and values from ChEMBL >> 24 (and making sure to use the new schema). >> >> Two problems. First, I'm not a computational chemist and I don't know >> what would constitute a good example to use. "Good" in this case means one >> whose outlines are well-known to likely students. Second, I don't have much >> experience with the ChEMBL data. >> >> My thought is to make a logP model. The easiest would be to based it on >> atom types. For this option, can anyone suggest where I can find logP data >> from ChEMBL? >> >> Another possibility is to use a pre-existing model, like the notebook >> George Papadatos did for Ligand-based Target Prediction at >> http://nbviewer.jupyter.org/gist/madgpap/10457778 . >> >> Perhaps someone here could point me to other existing resources along >> similar lines? >> >> Best regards, >> >> Andrew >> da...@dalkescientific.com >> >> >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] cis/trans info lost when generating SMILES from molfile
Hi RDKitters, I'm having trouble writing SMILES including cis/trans info for some molecules I load from molfile using rdkit. Openbabel and indigo are generating the expected SMILES. https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL501674 *# RDKIT* mol = Chem.MolFromMolFile('CHEMBL501674.mol') Chem.MolToSmiles(mol, isomericSmiles=True) 'CO[C@H]1[C@H](OC[C@H]2[C@@H]3O[C@@H]3C=CC(=O)[C@H](C)CC[C@H](O[C@@H]3O[C@H ](C)C[C@H](O)[C@H]3O)[C@@H](C)C=CC(=O)O[C@@H]2C)O[C@H](C)[C@@H](O)[C@H]1OC' *# OPENBABEL* mol_ob = pybel.readfile('mol', 'CHEMBL501674.mol') mol_smiles.write('can') 'CO[C@H]1[C@H](OC[C@@H]2[C@@H](C)OC(=O)*/C=C/*[C@H](C)[C@H](CC[C@H](C(=O) */C=C/*[C@@H]3[C@H]2O3)C)O[C@@H]2O[C@H](C)C[C@@H]([C@H]2O)O)O[C@@H]([C@H ]([C@H]1OC)O)C' *# INDIGO* mol = indigoObject.loadMoleculeFromFile('CHEMBL501674.mol') mol.smiles() 'C1(=O)[C@H](C)CC[C@H](O[C@@]2([H])[C@H](O)[C@@H](O)C[C@@H](C)O2)[C@ @H](C)C=CC(=O)O[C@H](C)[C@@H](CO[C@]2([H])O[C@H](C)[C@@H](O)[C@@H](OC)[C@H ]2OC)[C@]2([H])O[C@]2([H])C=C1 |*t:21,51* ,&1:2,6,8,10,12,15,18,25,27,30,33,35,37,40,43,46|' Indigo is using chemaxon extended notation, but is also . recognising t:21,51. If I check double bond stereo info for the molecule: for bond in mol.GetBonds(): if bond.GetBondType() == Chem.BondType.DOUBLE: print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetStereo()) 6 7 STEREONONE 11 12 STEREONONE 0 17 STEREONONE 5 43 STEREONONE No E/Z info in bonds. inchi bond stereo layer generated from the molfile: /b12-10+,13-9+ from rdkit generated smiles(with and without Compute2DCoords): /b12-10-,13-9- from obabel generated smiles: /b12-10+,13-9+ This might be a bug in the piece of code that detects bond stereo info from the molfile or maybe... I'm just missing something :P Thanks for your great job! -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss