Re: [Rdkit-discuss] issue during parsing a smile

2018-04-16 Thread Andrew Dalke
On Apr 16, 2018, at 16:29, Guillaume GODIN wrote: > And for this one C[C@@]12CC[C@@](C)(CC1)O2O any idea > > Cause your tool failed too. It's true that smiview failed, in the sense that it shouldn't have tried to do further analysis with a molecule that RDKit rejected. However, RDKit does rep

Re: [Rdkit-discuss] issue during parsing a smile

2018-04-16 Thread Andrew Dalke
If you try this out with my smiview package, available from https://bitbucket.org/dalke/smiview/downloads/ , it reports: % smiview 'C\(C(C)C)=N/O' Cannot parse --smiles: Unexpected term C\(C(C)C)=N/O ^ Tokenizing stopped here A bond must be followed by an atom, closure. That is, the bond

Re: [Rdkit-discuss] reassembling a molecule from R-groups

2018-04-15 Thread Andrew Dalke
On Apr 16, 2018, at 05:37, Patrick Walters wrote: > > Thanks Andrew, the SMILES approach seemed to have quite a few edge cases so I > wrote something to work directly on a molecule. That's the approach I started with, until I figured out that it doesn't preserve chirality. If I change the en

Re: [Rdkit-discuss] reassembling a molecule from R-groups

2018-04-15 Thread Andrew Dalke
Hi Pat, I wrote something like this for mmpdb, which is the MMPA code I helped develop, at https://github.com/rdkit/mmpdb . It has one restriction, which I'll get to in a moment. The general idea is to convert the attachment points to closures, join them with a ".", and canonicalize: >>> fr

Re: [Rdkit-discuss] [Rdkit-announce] [Announcement] 7th RDKit UGM in Cambridge UK

2018-04-11 Thread Andrew Dalke
On Apr 7, 2018, at 07:13, Greg Landrum wrote: > Andrew Dalke (Dalke Scientific) will offer a course on Python and the RDKit I need to finalize what I'm going to cover. I've been going between two approaches. 1) Python programming for cheminformatics This is meant for someone

[Rdkit-discuss] smiview 1.2

2018-04-03 Thread Andrew Dalke
About 10 days ago I posted a prototype program called 'smiview', which displays information about the structure of a SMILES string. Thanks to feedback from a couple of users, and a deep urge to explore the idea, I've just released smiview 1.2, available from https://bitbucket.org/dalke/smiview/

[Rdkit-discuss] smiview 1.1 - a console tool to view SMILES strings

2018-03-24 Thread Andrew Dalke
Over the last few days I've developed a command-line tool that I call "smiview". It's a SMILES viewer. It isn't a depiction tool where the input is in SMILES but rather a tool to highlight different aspects of the SMILES string. I'll put some examples at the end. If you want to try it out you ca

[Rdkit-discuss] chemfp 1.4 and RDKit

2018-03-20 Thread Andrew Dalke
Hi all, I've just released version 1.4 of chemfp, my cheminformatics fingerprint toolkit. It has several new features and bug fixes, which you can read at: http://chemfp.readthedocs.io/en/chemfp-1.4/#what-s-new-in-1-4 The new RDKit feature is the support for "fromAtoms" for those RDKit fin

Re: [Rdkit-discuss] postgres cartridge not parsing a SMILES

2018-01-13 Thread Andrew Dalke
Hi Rajarshi, Here's what RDKit says from the interactive shell: >>> from rdkit import Chem >>> Chem.MolFromSmiles("C1=CC=C(C=C1)[N]2=CC=CC3=C2C4=C(C=CC(=C4)C5=CC=CN=C5)N=C3") [23:02:36] Explicit valence for atom # 6 N, 4, is greater than permitted RDKit is pretty strict about accepting chemicall

Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Andrew Dalke
Hi Wandré, The easiest way to avoid recalculating the fingerprints is to keep the FPS file around. The rdkit2fps program calculates the AtomPair fingerprint and converts the resulting DataStructs fingerprint object into a hex-encoded fingerprint, which is stored as text in the FPS file. One

Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Andrew Dalke
On Jan 11, 2018, at 12:04, Wandré wrote: > Thanks for the link. It is very interesting. I will read very carefully. > So, as input on ChemFP, I have to put a file with all molecules in 1 SDF? Chemfp works with fingerprint files, in your case, chemfp's text-based "FPS" format. You'll need to use

Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Andrew Dalke
Hi Wandré, You may want to look at chemfp for this sort of clustering. Last year Chris Swain reviewed a few different ways to do clustering, at https://www.macinchem.org/reviews/clustering/clustering.php . His data set had 4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity th

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Andrew Dalke
On Nov 9, 2017, at 21:49, Brian Cole wrote: > Certainly, but thousands of lines of Python doesn't fit in an email in an > easily digestible way. :-) I'll restate things since I wasn't clear. While this step may be what you need for the way you structure things, there might be a better way to st

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Andrew Dalke
On Nov 9, 2017, at 16:09, Brian Cole wrote: > Here's an example of why this is useful at maintaining molecular > fragmentation inside your molecular representation: > > >>> from rdkit import Chem > >>> smiles = 'F9.[C@]91(C)CCO1' > >>> fluorine, core = smiles.split('.') > >>> fluorine > 'F9

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Andrew Dalke
On Nov 9, 2017, at 08:13, Greg Landrum wrote: > As was discussed in the comments of > https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that the > second syntax is even legal. But that's a side point. To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it, wh

Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-

2017-11-08 Thread Andrew Dalke
On Nov 8, 2017, at 21:00, Chenyang Shi wrote: > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)] The recursive SMARTS notation, which is the term inside of the [$(...)], finds a match for the entire pattern and returns the first atom in that pattern. > For example, if I search "C=C=O" using "[CH0;A;X2;!R](

Re: [Rdkit-discuss] Can't stifle warnings / logs

2017-09-22 Thread Andrew Dalke
Hi Cameron, While you are waiting for an answer about the proper way to silence errors, I can give you a work-around which will help with the metaphorical reams of teletype paper you are printing out. However, it is a very crude solution. Basically, close the C/C++ stderr file descriptor, an

Re: [Rdkit-discuss] mmpdb installation on windows using mingw

2017-09-22 Thread Andrew Dalke
On Sep 22, 2017, at 14:26, Kramer, Christian wrote: > thanks for pointing this out. The reason for that error message is that > signal.SIGPIPE is not available under windows. This seems to have slipped > below our radar, since we developed the code on Linux. And Mac. :) It put that code there

[Rdkit-discuss] chemfp 1.3 released

2017-09-18 Thread Andrew Dalke
Hi all, I have just released chemfp 1.3. It is available from http://dalkescientific.com/releases/chemfp-1.3.tar.gz . Chemfp is a set of command-line tools and a Python library for working with cheminformatics fingerprints. It can use OEChem/OEGraphSim, RDKit, or Open Babel to create fingerpri

Re: [Rdkit-discuss] ImportError: No module named rdkit

2017-09-14 Thread Andrew Dalke
On Sep 14, 2017, at 19:26, Dimitri Maziuk wrote: > Just FYI: python 2.6 is the system python on (at least) RHEL-6 family of > linux distros that will be officially with us until June 30, 2024. If only Greg got as much money for long term RDKit support as Red Hat gets for long term RHEL support. :

Re: [Rdkit-discuss] Using Chem.WrapLogs()

2017-09-08 Thread Andrew Dalke
On Sep 8, 2017, at 15:51, Noel O'Boyle wrote: > > Hi all, > > I'd like to capture error messages during SMILES parsing, but am having > trouble getting this to work. ... > assert sio.read() != "" That should be a sio.getvalue(). The read() starts from the current file position, which is at

Re: [Rdkit-discuss] list of failed chembl ids

2017-08-08 Thread Andrew Dalke
On Aug 8, 2017, at 22:20, Peter S. Shenkin wrote: > But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse. As of ChEMBL 23, the following files are available: - the sdf.gz file - pre-computed RDKit Morgan fingerprints in fps.gz format - the database available as an S

Re: [Rdkit-discuss] Sanitize Error

2017-07-01 Thread Andrew Dalke
On Jul 1, 2017, at 17:19, Changge Ji wrote: > I want to do some substructure match using MCS. > It seems that Sanitize is needed for MCS. > I met with the over valance error when using sanitize for some molecules. > > Like the following one : > > sa = Chem.MolFrom

Re: [Rdkit-discuss] The RDKit and Python3

2017-06-19 Thread Andrew Dalke
On Jun 19, 2017, at 17:39, Dan Wandschneider wrote: > Greg- > Is the RDKit currently compatible with Python3? If not, when do you expect I > could start migrating a code base that depends on the RDKit? I'm not Greg, but I can answer that question. The RDKit has been available for both Python 2

Re: [Rdkit-discuss] Clustering

2017-06-14 Thread Andrew Dalke
Following up on myself, On Jun 6, 2017, at 04:00, Andrew Dalke wrote: > I've fleshed out that algorithm so it's a command-line program that can be > used for benchmarking purposes. It's available from > http://dalkescientific.com/writings/taylor_butina.py . &g

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Andrew Dalke
On Jun 5, 2017, at 11:02, Michał Nowotka wrote: > Is there anyone who actually done this: clustered >2M compounds using > any well-known clustering algorithm and is willing to share a code and > some performance statistics? Yes. People regularly use chemfp (http://chemfp.com/) to cluster >2M comp

Re: [Rdkit-discuss] Non-standard Heavy Atoms and CHemFP Fingerprints

2017-05-19 Thread Andrew Dalke
On May 19, 2017, at 21:59, Markus Heller wrote: > [In chemfp] I get the following error: > > [11:37:55] Explicit valence for atom # 6 Te, 4, is greater than permitted > ERROR: Cannot parse the SMILES > 'CC(C)(/C(=C\\Cl)/[Te-2](c1ccc(cc1)OC)(Cl)Cl)O' at line 155850 of > chembl_23.fixed.smi. Exit

Re: [Rdkit-discuss] Fast similarity search

2017-05-19 Thread Andrew Dalke
On May 19, 2017, at 08:33, Greg Landrum wrote: > The best solution to this is to use chemfp. It's a remarkable piece of > software. Thanks, Greg. > If you aren't willing to license that, the RDKit's search brute-force > fingerprint search capabilities aren't too bad for in-memory fingerprints.

Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Andrew Dalke
On Apr 19, 2017, at 23:59, Peter S. Shenkin wrote: > One more thing. The term "Mol" in RDKit and some other tookits does not > really mean "molecule" in the sense that chemists use it. ? I don't see how this is connected to the previous emails. I believe most toolkits use that terminology in th

Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Andrew Dalke
On Apr 19, 2017, at 18:26, Curt Fischer wrote: > From chemistry stack exchange, an answer contributed by user R.M.: > > SMARTS is deliberately designed to be a superset of SMILES. That is, any > valid SMILES depiction should also be a valid SMARTS query, one that will > retrieve the very struct

Re: [Rdkit-discuss] Information contained in SMARTS and SMILES

2017-04-19 Thread Andrew Dalke
On Apr 19, 2017, at 12:03, Thilo Bauer wrote: > is converting SMARTS to SMILES a "lossless" operation, or does one loose > information on doing so? It is obviously not lossless if you include terms that cannot be represented in SMILES. >>> from rdkit import Chem >>> Chem.MolToSmiles(Chem.MolF

Re: [Rdkit-discuss] Cannot import rdBase after installed rdkit by source in a non-administrator linux cluster

2017-03-28 Thread Andrew Dalke
On Mar 28, 2017, at 17:56, 杨弘宾 wrote: > Have you tried install rdkit from source? It's ok when I installed rdkit > by conda in my PC. But when I tried installing it in a server in which I am > only a user who cannot use "sudo" and the "python" is in a read-only > directory. Yes I have, and

Re: [Rdkit-discuss] How to determine if atoms are part of the same ring?

2017-02-08 Thread Andrew Dalke
On Feb 8, 2017, at 19:22, Markus Metz wrote: > The question to you is: Is there another more elegant way of doing it? May be > I missed something from the python API? I don't quite follow what you are looking for, though I have managed to condense your code somewhat, into: updatedMapping = Non

Re: [Rdkit-discuss] isotopic SMILES

2017-02-07 Thread Andrew Dalke
On Feb 7, 2017, at 22:26, Curt Fischer wrote: > def same_implicit_valence(mol_1, mol_2, atom_idx=1): > """Returns True if mol_1 and mol_2 have the same implicit valence for the > indexed atom""" > mol_1_implicitH = mol_1.GetAtomWithIdx(atom_idx).GetImplicitValence() > mol_2_implicitH

Re: [Rdkit-discuss] isotopic SMILES

2017-02-07 Thread Andrew Dalke
On Feb 7, 2017, at 19:02, Curt Fischer wrote: > My ultimate goal is an easy way to create rdkit molecules that have isotopic > substitutions but which are otherwise exactly the same as non-substituted > variants. What's the best approach? Is it to directly call .SetIsotope() > like I do above

Re: [Rdkit-discuss] isotopic SMILES

2017-02-06 Thread Andrew Dalke
On Feb 7, 2017, at 01:17, Curt Fischer wrote: > I am confused by this behavior: > > >>> labeled_etoh = Chem.MolFromSmiles('C[13C]O') > >>> print(Chem.MolToSmiles(labeled_etoh)) > > C[C]O > > >>> print(Chem.MolToSmiles(labeled_etoh, isomericSmiles=True)) > > C[13C]O > > 1. Why are there any br

Re: [Rdkit-discuss] Rdkit atom indexing vs indexing in written pdb file

2017-02-01 Thread Andrew Dalke
Dear Susan, If I understand what's going on correctly, you have run across the difference between 0-based and 1-based indexing. See https://en.wikipedia.org/wiki/Zero-based_numbering . RDKit, like most programming libraries and languages, index based on an offset from the beginning, so 0 mea

Re: [Rdkit-discuss] MolToSmiles

2016-12-19 Thread Andrew Dalke
On Dec 19, 2016, at 6:22 PM, Brian Kelley wrote: > I had thought about making a CanonicalAtomOrder function that does this as > well, or perhaps making a MolToSmiles variant. I learned about this function from Noel's blog post at https://nextmovesoftware.com/blog/2013/07/01/accessing-smiles-atom

Re: [Rdkit-discuss] MolToSmiles

2016-12-18 Thread Andrew Dalke
On Dec 18, 2016, at 6:32 PM, Brian Kelley wrote: > >>> m.GetProp("_smilesAtomOutputOrder") > '[3,2,1,0,]' > > Note that this returns the list as a string which is sub-optimal. > GetPropsAsDict will convert these to proper python objects, however, this is > considered a private member so you nee

Re: [Rdkit-discuss] Canonicalisation with reaction labels

2016-12-17 Thread Andrew Dalke
On Dec 16, 2016, at 3:27 PM, Andrew Dalke wrote: > 2013 RDKit didn't preserve the atom order between labeled and unlabeled atoms. ... > I no longer have an older version of RDKit installed. My memory is wrong. I have rebuilt a version from 2013 and been unable to find a failure cas

Re: [Rdkit-discuss] SDwriter

2016-12-16 Thread Andrew Dalke
On Dec 17, 2016, at 1:45 AM, Milinda Samaraweera wrote: > However at the end of each tag header I noticed there is a number (bolded): > > ... > >(1) > N1-(2-ethylbutyl)hexane-1,3,6-triamine ... > What is this number and how you avoid printing this number when SDwriter is > used? As this n

Re: [Rdkit-discuss] Canonicalisation with reaction labels

2016-12-16 Thread Andrew Dalke
On Dec 16, 2016, at 1:55 PM, Stephen Pickett wrote: > With a 2013 RDkit install we get consistent canonicalization between reaction > labelled and unlabelled atoms. > >>> mol = Chem.MolFromSmiles('C1CC([*])CCN1') > >>> Chem.MolToSmiles(mol) > '[*]C1CCNCC1' > >>> mol = Chem.MolFromSmiles('C1CC([*:1

Re: [Rdkit-discuss] Generating all stereochem possibilities from smile

2016-12-09 Thread Andrew Dalke
On Dec 9, 2016, at 9:50 PM, Brian Kelley wrote: > >>> from rdkit import Chem > >>> m = Chem.MolFromSmiles("F/C=C/F") > >>> for bond in m.GetBonds(): > ...print bond.GetStereo() > ... > STEREONONE > STEREOE > STEREONONE > > However, setting bond stereo doesn't appear to be exposed. I thought

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Andrew Dalke
On Dec 5, 2016, at 3:28 PM, Alexis Parenty wrote: > For the parenthesis issue, the difficulty is to differentiate the SMILES > formats (xxx)(xxx) from this one (xxx)… I will try and address > that using something like: I do not understand. The first one is not a SMILES format. Can y

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Andrew Dalke
On Dec 5, 2016, at 11:35 AM, Alexis Parenty wrote: > I have tested my script on: > • 7900 unique SMILES for “drug-like molecules” > • Alice’s adventure in wonderland (I never read the book but I assumed > there is no SMILES!) > • A shuffled mixture of Alice’s in wonderland and 7900 uni

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-03 Thread Andrew Dalke
On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote: > I hacked a version of RDKit's smiles parser to compute heavy atom count, > perhaps some version of this could be used to check smiles validity without > making the actual molecule. FWIW, here's my regex code for it, which makes the assumption tha

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-03 Thread Andrew Dalke
On Dec 3, 2016, at 3:02 PM, Brian Kelley wrote: > If I had to pick, I would just use the normal MolFromSmiles, if you don't > expect many actual smiles strings in your corpus, it's plenty fast. I didn't follow from your timings what you used to see if something was a SMILES candidate? Was it wo

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke
On Dec 2, 2016, at 10:05 PM, Brian Kelley wrote: > Here is a very old version of Andrew's parser in code form: ... It was fairy > well tested on the sigma catalog back in the day. It might be fun to > resurrect use it in some form. There's also my OpenSMILES parser written for Ragel: https:/

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke
On Dec 2, 2016, at 10:12 PM, George Papadatos wrote: > If Alexis wants to search for valid SMILES strings representing typical > organic molecules among text of plain English words, would it not be safe to > assume that any word containing more than 4 'C' or 'c' characters would only > be a SMIL

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke
osures, and where the "connector" # is the possible combinations of open/close parentheses, dot disconnect, # or bond. # It does not attempt to balance parenthesies, ensure matching ring # closures, or handle aromaticity. those cannot be done with a regular # expression. # Written in 20

[Rdkit-discuss] identify chiral atoms which became achiral after fragmenting

2016-10-06 Thread Andrew Dalke
I'm trying to figure out which atoms lose chirality after breaking bonds using FragmentOnBonds(). Here's an example where a chiral carbon after fragmentation gets two "*" atoms, which makes the carbon achiral: >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("F[C@](Cl)(Br)O") >>> fragmen

Re: [Rdkit-discuss] MolFromMolBlock does not read properties

2016-10-03 Thread Andrew Dalke
On Oct 2, 2016, at 10:48 PM, Maciek Wójcikowski wrote: > Yes I get it, but obviously there is no MolFromSDBlock, so one would suspect > MolFromMolBlock to support both formats. As I understand correctly the only > way of reading SD from variable is as presented in my example? Or is there > some

Re: [Rdkit-discuss] MolWt of substructure hit?

2016-09-07 Thread Andrew Dalke
On Sep 7, 2016, at 11:53 AM, Stephen O'hagan wrote: > How would I find the molecular weight (fraction) of that substructure within > a compounds expressed as a SMILES string, e.g.: I don't know if a built-in function which does this. It's possible to write one. Here's a function which will compu

Re: [Rdkit-discuss] Chirality conservation during atom replacement

2016-06-21 Thread Andrew Dalke
On Jun 21, 2016, at 5:26 PM, Greg Landrum wrote: > Because chirality is represented relative to the ordering of the bonds around > an atom, it's pretty difficult to do this if you want to actually break and > add bonds on your own. This would probably be somewhat easier if there were > an RWMol.

Re: [Rdkit-discuss] stereochemistry of S with degree 3

2016-02-11 Thread Andrew Dalke
On Feb 10, 2016, at 6:09 AM, Greg Landrum wrote: > I agree that this is a bug. Glad to hear. I was wondering how I would get my code to handle that case otherwise. On Feb 10, 2016, at 4:19 PM, David Cosgrove wrote: > As chiralities go, this one has turned out to be quite valuable! You can tell

Re: [Rdkit-discuss] stereochemistry of S with degree 3

2016-02-08 Thread Andrew Dalke
On Feb 8, 2016, at 7:03 PM, Paolo Tosco wrote: > ... there is a "ghost" atom involved in determining the sulfur chirality, > which is the sulfur lone pair. Even if this is not in the Daylight specs, the > lone-pair is usually treated as an implicit hydrogen, and therefore > considered as the fir

Re: [Rdkit-discuss] stereochemistry of S with degree 3

2016-02-08 Thread Andrew Dalke
Thanks Paolo and Hannes for pointing me to sulfoxide. I am enlightened! I assume this is something that every chemist knows, but it's not mentioned in the Daylight SMILES documentation (or the OpenSMILES documentation), so I had no clue. I wonder how many more cases there are like that. Any ide

[Rdkit-discuss] stereochemistry of S with degree 3

2016-02-08 Thread Andrew Dalke
Hi! Could someone explain to this non-chemist what the chirality means in the following? CN[S@@](=O)C1=CC=CC=C1 It comes from PubChem id 12194260 at https://pubchem.ncbi.nlm.nih.gov/compound/12194260 . Isn't this a symmetric structure, which can't have an orientation at that point? Even

Re: [Rdkit-discuss] how to replace a bond and preserve chirality

2016-02-04 Thread Andrew Dalke
Hi Dave, Thanks for the suggestion about mutating the atom in-place then pruning the rest of the R-group away. This will work, but it's inelegant and slow. Here's why. I'm trying to construct an R-group table for each core, for up to 3 R-groups, of a data set. For that, I need to know the ca

Re: [Rdkit-discuss] confused about explicit hydrogens and canonicalization

2016-02-03 Thread Andrew Dalke
On Feb 3, 2016, at 6:42 AM, Greg Landrum wrote: > 2) If you add a call to Chem.SanitizeMol(hydrogren_mol) before any of the > calls to SMILES generation, it clears up the problem. The calls to > SetNumExplicitHs() are not necessary. I am able to fix my problem by adding a SANITIZE_ADJUSTHS : de

Re: [Rdkit-discuss] confused about explicit hydrogens and canonicalization

2016-02-03 Thread Andrew Dalke
On Feb 3, 2016, at 6:42 AM, Greg Landrum wrote: > 1) in the code you have this snippet: > # This gives: c1ccc(nc1)-n1ncc2ccc(nc21)C1CC1 > # That SMILES appears to be incorrect! > Why do you think that's true? I was incorrect in saying "incorrect". I should have said "not canonical". I expect the

[Rdkit-discuss] how to replace a bond and preserve chirality

2016-02-02 Thread Andrew Dalke
I'm working on a project where I cut a molecule along certain single bonds, to find a core structure and one or more R-groups. In yesterday's email, I mentioned a problem I have in creating a canonical SMILES for the core when the R-groups are replaced by a hydrogen. I also want to create a SMI

[Rdkit-discuss] confused about explicit hydrogens and canonicalization

2016-02-01 Thread Andrew Dalke
Hi all, I have a problem that I think is due to my not understanding how to work with explicit (or perhaps implicit) hydrogens. In my project, I want to find the core of a molecule as well as its R-groups. I use a SMARTS pattern to find the bonds to cut, then want to store two versions of

Re: [Rdkit-discuss] atom indices

2015-12-10 Thread Andrew Dalke
On Dec 9, 2015, at 1:53 PM, chris dalton wrote: > I have fragmented a molecule using GetMolFrags and want to relate the atoms > in the fragments to the original molecule. However, each fragment appears to > start at atom index 0 which prevents direct comparison with the original > atoms. One s

Re: [Rdkit-discuss] Count carbon atoms

2015-10-08 Thread Andrew Dalke
On Oct 8, 2015, at 2:38 PM, John M wrote: > This seems odd... surely you can't go faster that iterating over the atoms > and counting element 6? One is in C++, the other is in Python. > Perhaps the python iter is indeed slower than a SMARTS match but that can't > be true? The "for atom in mo

Re: [Rdkit-discuss] Count carbon atoms

2015-10-07 Thread Andrew Dalke
On Oct 7, 2015, at 11:38 PM, Ling Chan wrote: > Or you can use AllChem.CalcMolFormula() to get the chemical formula. Well spotted! It's a bit tricky because it needs to handle carbons with/without count ("CH4", "C2H6"), and structures with no carbons ("P", "Ca", "Cd"); the last two start with a

Re: [Rdkit-discuss] Count carbon atoms

2015-10-07 Thread Andrew Dalke
On Oct 7, 2015, at 11:30 AM, Christos Kannas wrote: > Yes there is an easier way, by using substructure search, i.e. do a > substructure search for [C] and then get the number of matches. > m = Chem.MolFromSmiles("c1c1") > patt= Chem.MolFromSmarts("[C]") > pm = m.GetSubstructMatches

Re: [Rdkit-discuss] trouble with SMARTs interpretation of 'not hydrogen'

2015-09-16 Thread Andrew Dalke
On Sep 16, 2015, at 9:57 PM, Bodle, Christopher R wrote: > I am having trouble with RDKit correctly interpreting the SMARTS character > [!#1], which should be interpreted as "any atom not hydrogen. I've been looking at your emails but it's difficult for me to figure out what you are doing. Can y

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote: > I hope the memory issue won't be a problem. That's up to you and your choice of threshold. > Most AgglomerativeClustering algorithms have time complexity with N^2. Will > that be a problem? You have to decided for yourself what counts as a problem.

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote: > If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I > calculate the distance between every pair of molecules, the size of distance > matrix will be too big. Does RDKit support any heuristic clustering algorithm > without cal

Re: [Rdkit-discuss] about MACCSkeys

2015-08-01 Thread Andrew Dalke
Dear Takayuki, On Aug 1, 2015, at 3:54 AM, Taka Seri wrote: > Why the [MACCSkeys] bit length is not 166 bit ? > The RDKit MACCS implementation follows the MACCS key assignments, which start at 1. MACCS bit 0 is always set to 0, bit 1 corresponds to key 1, etc., so key 166 is at bit 166, givi

Re: [Rdkit-discuss] Two SMILES that (I think) should canonicalize to the same thing, but don't

2015-06-16 Thread Andrew Dalke
On Jun 16, 2015, at 10:20 PM, Peter Shenkin wrote: > [N-]=[N+]=NC(=O)N1C(=O)N([N+]([O-])=O)C2(C13C4=C56)C4=C5C2=C36 > [N-]=[N+]=NC(=O)N(C(=O)N1[N+]([O-])=O)C(c23)(c4c56)C16c3c5c24 > > rdkit canonicalizes the two to the following, respectively: > > [N-]=[N+]=NC(=O)N1C(=O)N([N+](=O)[O-])C23c4c5c2c2

Re: [Rdkit-discuss] IUPAC name

2015-06-11 Thread Andrew Dalke
On Jun 11, 2015, at 2:20 PM, Laëtitia Bomble wrote: > Is there a rdkit tool to get IUPAC name of a molecule? No, there isn't. If you only have a few names, and/or are willing to wait for a web service, you can use the NCI resolver at http://cactus.nci.nih.gov/chemical/structure For example, th

[Rdkit-discuss] propbox-0.5

2015-06-09 Thread Andrew Dalke
Hi all, I spent the last couple of week working on a project related to molecular property and model calculations. It's called 'propbox', and is available from https://bitbucket.org/dalke/propbox . There are two parts to it: - a (sparse) table, where the rows are structures and the columns

Re: [Rdkit-discuss] SDF properties in case of error

2015-05-01 Thread Andrew Dalke
On May 1, 2015, at 12:01 AM, Michael Reutlinger wrote: > However, in some cases this does not help. E.g. when an unknown atom (most of > the time this is X) is found in the MolBlock the import fails with an > Post-condition Violation and None is yielded. This is fine to detect the > problem BUT

Re: [Rdkit-discuss] SDF tags and "->"

2015-04-30 Thread Andrew Dalke
On Apr 30, 2015, at 6:08 AM, Greg Landrum wrote: > I still need to put some thought into patching the SDWriter so that it can > recognize things like consecutive line endings in property values. The big > question is what it should do when it encounters such a case. Is that an > error? Should it

Re: [Rdkit-discuss] SDF tags and "->"

2015-04-29 Thread Andrew Dalke
On Apr 29, 2015, at 9:19 PM, Dimitri Maziuk wrote: > There is a difference between ACM members writing network protocols and > "domain" people writing junk. I think that you are saying that the MDL connection table file formats are junk. I do not disagree. But it's something we have to deal with s

Re: [Rdkit-discuss] SDF tags and "->"

2015-04-29 Thread Andrew Dalke
On Apr 29, 2015, at 7:30 PM, Dimitri Maziuk wrote: > Based on "be liberal in what you accept and conservative in what you > produce", the writer should Postel's Robustness principle is a mistake. See RFC 3117 for elaboration, at http://tools.ietf.org/html/rfc3117#page-16 Counter-intuitively, P

Re: [Rdkit-discuss] SDF tags and "->"

2015-04-29 Thread Andrew Dalke
Riccardo Vianello: > I suppose that if the correctness of the parser is confirmed, then a change > could be suggested for the writer, consisting in raising an error if blank > lines are present inside the data item. Yes, the SD tag data is not a general purpose data field. It's not possible,

Re: [Rdkit-discuss] UGM2014: interactive SDF viewer - code on GitHub

2014-11-03 Thread Andrew Dalke
On Nov 3, 2014, at 10:22 AM, Pahl, Axel wrote: > Please have fun with the program and let me know if there are any bugs > or improvement proposals (apart from those already listed in the README). It looks very nice! You might want to put a link to http://nbviewer.ipython.org/github/apahl/sdf

Re: [Rdkit-discuss] Error in reading sdf format

2014-07-23 Thread Andrew Dalke
On Jul 23, 2014, at 10:26 PM, Abhik Seal wrote: > I have a sdf file attached(2 molecules only) What you have isn't an SD file. It's missing a line in the header block. The header block is supposed to contain three lines. I quote from the specification: > Line 1: Molecule name. This line is unfo

Re: [Rdkit-discuss] MCS-based similarity in carbohydrates

2014-05-08 Thread Andrew Dalke
Hi Sushil, On May 8, 2014, at 12:26 PM, Sushil Mishra wrote: > MCS algorithm seems to me unable to handle chiral carbons and it can not > differentiate chiral changes in ligands. That's correct. The MCS algorithm in RDKit doesn't consider chirality. While in principle I think it would be possi

Re: [Rdkit-discuss] Aromatic Boron SMARTS

2014-02-18 Thread Andrew Dalke
On Feb 18, 2014, at 6:51 PM, Matthew Swain wrote: > I don't really know what's going on here, but you could try [#5!B] for you > SMARTS. > > #5 to match any boron, and !B to disallow non-aromatic. Another possibility is [#5a], since "a" means "aromatic" >>> from rdkit import Chem >>> mol = Chem

Re: [Rdkit-discuss] problems with 'Ames mutagenicity dataset analysis using RDKit and PANDAS' tutorial

2013-11-25 Thread Andrew Dalke
On Nov 24, 2013, at 11:58 PM, Nikolas Fechner wrote: > if I remember correctly 10.1 was an intermediate pandas version where the > HTML rendering in tables, that we use for rendering the structures, does not > work as we need it. In this version pandas introduced an HTML escaping, which > leads

Re: [Rdkit-discuss] Beta of Q3 2013 release available

2013-10-25 Thread Andrew Dalke
On Oct 25, 2013, at 10:11 AM, Roger Sayle wrote: > The use of an integer file format "flavor" argument allows the caller > to customize the behavior of the readers and writers. The semantics > is that a reasonable default is zero (for all bits), but that new > features may be added without chang

Re: [Rdkit-discuss] Substructure search paper

2013-04-11 Thread Andrew Dalke
On Apr 11, 2013, at 1:39 PM, Quentin Delettre wrote: > I was more concerned about algorithms/implementation, pitfalls that > could happen and performance. There are none. "Pretty much every cheminformatics toolkit can do what you want." The toolkits I know of use either the Ullmann algorithm or

Re: [Rdkit-discuss] Substructure search paper

2013-04-11 Thread Andrew Dalke
On Apr 11, 2013, at 10:10 AM, Quentin Delettre wrote: > I plan to use substructure search for around 1500 molecules versus 3000 small > fragments .. > I am quite new in the field and it's the occasion to compare programs and > libraries > that can do that. Can you provide me some links to papers

Re: [Rdkit-discuss] ligand MCS alignment

2013-03-19 Thread Andrew Dalke
Hi Fabian, On Mar 19, 2013, at 2:05 PM, Fabian Dey wrote: > - in order to get a 1-1 correspondence of atom ids (to get the coordinate > map) I had to search the MCS-SMARTS match again against the original files to > get the atom-ids - is there a more direct way to do this? There is no more dire

Re: [Rdkit-discuss] RDKit participation in the Google Summer of Code

2013-02-21 Thread Andrew Dalke
On Feb 22, 2013, at 6:51 AM, Greg Landrum wrote: > Please feel free to add to the list by either commenting on that page, > sending ideas here, or emailing them to me directly. Add an implementation of Noel's method to use InChI to get a canonical ordering for SMILES output. Any improvements to t

Re: [Rdkit-discuss] GetItemText for ForwardSDMolSupplier

2013-02-19 Thread Andrew Dalke
On Feb 18, 2013, at 5:50 PM, paul.czodrow...@merckgroup.com wrote: > My issue2solve: read in a sdf.gz & simply extract the SD tags. If you don't mind digging into the undocumented chemfp API (which mean that it may change in the future), then you can use the simple-minded SDF reader I wrote for it

Re: [Rdkit-discuss] how to use structure as substructure query

2012-12-20 Thread Andrew Dalke
On Dec 15, 2012, at 4:40 AM, Greg Landrum wrote: > Note that this also means that the H in C[OH] is ignored, so it's now a > substructure of C[O-]. For finer-grain control over H specifications in > queries, you will need to use either SMARTS or molecules that have Hs added. > > This look ok? Y

Re: [Rdkit-discuss] one flavor of MCS

2012-12-13 Thread Andrew Dalke
On Dec 13, 2012, at 3:32 PM, paul.czodrow...@merckgroup.com wrote: >> I think I figured out a way around that via some post-processing. > > Great! > Now let's come to another question: > How does one code the "complete-ring-only" variation? > Can your code be adapated, or shall I do some post-pr

Re: [Rdkit-discuss] one flavor of MCS

2012-12-13 Thread Andrew Dalke
On Dec 13, 2012, at 9:18 AM, paul.czodrow...@merckgroup.com wrote: > Regarding the issues you mentioned ... > > - non-canonical SMARTS > - duplicates are not filtered out I think I figured out a way around that via some post-processing. >> Or do you mean the number of molecules which contain

Re: [Rdkit-discuss] how to use structure as substructure query

2012-12-04 Thread Andrew Dalke
On Dec 4, 2012, at 11:56 AM, Greg Landrum wrote: > Bonds are matched purely using bond type, with the one exception that a bond > of unspecified type matches anything and is matched by anything. And I forgot to ask - is there any way in SMILES to produce a bond of unspecified type? > hmm, not su

Re: [Rdkit-discuss] how to use structure as substructure query

2012-12-04 Thread Andrew Dalke
I am beginning to realize the error of my ways. This is the same issue which occurred in fmcs. Suppose you have c1c1C and CC. The MCS between those two is [#6]-[#6]. Atom aromaticity is not useful when doing a comparison. On Dec 4, 2012, at 5:32 AM, Greg Landrum wrote: > Aromaticity is ignor

Re: [Rdkit-discuss] how to use structure as substructure query

2012-12-03 Thread Andrew Dalke
On Dec 3, 2012, at 4:55 PM, Greg Landrum wrote: > Yes, it's here: > http://www.rdkit.org/docs/RDKit_Book.html#atom-atom-matching-in-substructure-queries Thanks. It's incomplete though - it doesn't show how bonds are matched nor how aromaticity is handled for atoms. Does a SMILES with a "C" mean t

[Rdkit-discuss] how to use structure as substructure query

2012-12-03 Thread Andrew Dalke
What are the steps one must to to use an input structure (from a SMILES string) as a substructure query? It looks like I need to remove explicit hydrogens [* see footnote]. Is there anything else? And what is the right way to remove explicit hydrogens? I'm working again on a project to do substru

Re: [Rdkit-discuss] MCS bug when ringMatchesRingOnly=False but completeRingsOnly=True

2012-11-23 Thread Andrew Dalke
Hi Greg, > I've found some behavior in the MCS code that I would call a bug. (I'm > hedging a bit because I realize one could argue about this one...) > ... > [11]>>> MCS.FindMCS(mols,ringMatchesRingOnly=False,completeRingsOnly=True) > [11]: MCSResult(numAtoms=-1, numBonds=-1, smarts=None, compl

[Rdkit-discuss] feedback for my MCS benefactor

2012-11-22 Thread Andrew Dalke
Hi all, As you may know, Roche funded me to implement the multiple-structure MCS algorithm which is now part of RDKit. They see it as a way to contribute back to free and open source cheminformatics software projects. I would like to show them that it has been a success, or at least that peop

Re: [Rdkit-discuss] pilfont error message

2012-11-20 Thread Andrew Dalke
On Nov 20, 2012, at 11:05 AM, paul.czodrow...@merckgroup.com wrote: > The situation is getting complicated, since your hack did not help. With your error message, I see that only 'sans' is allowed in RDKit. So says rdkit/Chem/Draw/spingCanvas.py: faceMap={'sans':'helvetica'} which means I gave y

<    1   2   3   4   >