Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?
Dear Thomas, You can get the SMILES of substructures that are extracted via `GetMorganFingerprint` function as follows. Then, you can append any labels to the SMILES string but not real numbers. ```python from rdkit import Chem mol = Chem.MolFromSmiles('Cc1n1') info = {} AllChem.GetMorganFingerprint(mol, radius=2, bitInfo=info) radius, atom_id = list(info.values())[0][0][::-1] env = Chem.FindAtomEnvironmentOfRadiusN(mol, radius, atom_id) sub_struct = Chem.PathToSubmol(mol, env) type(sub_struct) #=> rdkit.Chem.rdchem.Mol Chem.MolToSmiles(sub_struct) #=> 'ccc' ``` Best, On Fri, 22 Nov 2019 at 23:40, Thomas Evangelidis wrote: > Greetings, > > Could someone please clarify how can I pass atomic partial charges to the > ECFP fingerprint generator along with the default atomic properties that it > considers? Can I pass the real charge values or do I have to group them > into bins and pass the bin identifier? I found a function in utilsFP.py > file which generates invariants as follows: > > def generateAtomInvariant(mol): > """ > >>> generateAtomInvariant(Chem.MolFromSmiles("Cc1n1")) > [341294046, 3184205312, 522345510, 1545984525, 1545984525, 1545984525, > 1545984525] > """ > num_atoms = mol.GetNumAtoms() > invariants = [0]*num_atoms > for i,a in enumerate(mol.GetAtoms()): > descriptors=[] > descriptors.append(a.GetAtomicNum()) > descriptors.append(a.GetTotalDegree()) > descriptors.append(a.GetTotalNumHs()) > descriptors.append(a.IsInRing()) > descriptors.append(a.GetIsAromatic()) > invariants[i]=hash(tuple(descriptors))& 0x > return invariants > > > And then generate the fingerprint like this: > > > fp = AllChem.GetMorganFingerprint(mol, radius=3, > invariants=generateAtomInvariant(mol)) > > > Would just suffice to add this extra line in generateAtomInvariant() function? > > > descriptors.append(a.GetFormalCharge()) > > > > I thank you in advance. > Thomas > > > > -- > > == > > Dr. Thomas Evangelidis > > Research Scientist > > IOCB - Institute of Organic Chemistry and Biochemistry of the Czech > Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, > Prague, > Czech Republic > & > CEITEC - Central European Institute of Technology <https://www.ceitec.eu/> > , Brno, Czech Republic > > email: teva...@gmail.com, Twitter: tevangelidis > <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis > <https://www.linkedin.com/in/thomas-evangelidis-495b45125/> > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- The University of Tokyo 2nd year Ph.D. candidate Shojiro Shibayama ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Dividing inputstream over threads
Hi, A python standard library multiprocessing may help you to parallelize your code. I wrote a code that converts SMILES to hashed MorganFP using parallel computation in the following short post. The code took 10 mins for 1.5m compounds when 6 processes were used. https://loudspeaker.sakura.ne.jp/devblog/2019/01/20/python-multiprocessing-write-strings-single/ multiprocessing.Pool.imap can be incorporated into for loop, which safely accesses to a text file or even your SQL. I guess SQLalchemy in python might be good, but I'm not sure. Hope that you'll find out a good library of SQL OR mapper for python. Sincerely yours, Shojiro On Tue, 15 Jan 2019, 01:54 Andreas Luttens Hi! > > I have developed a small script that calculates molecules properties for > molecules that are stored in a SMILES file. The properties should be stored > in an SQL database, which works fine, but I would like to speed up the > process a bit. I was thinking of implementing some parallelization for the > calculating of properties and storing into separate connections to my SQL > database. I have done this before in Python with OpenEye and seems to be > doing the trick. I would however want my code to useable by people who do > not hold a license for OpenEye, which is why I try RDKit. I would like my > code to be in C++ as well. > > I was wondering how I would tackle this problem. Does the RDKit have a > similar functionality as an "oemolithread" to chunk up the incoming stream? > I haven't found something like this when I first scrolled through > documentation. If it is not implemented, how would I divide the work on > incoming molecules over N threads? > > All help is very appreciated. Thanks in advance. > > Best regards, > > Andreas Luttens > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance
Dear Ali, Please run first the following code, which may help you: ```python import numpy as np np.argsort(rfregress.feature_importances_)[::-1] ``` The `argsort` will return the indexes of the important features in ascending order and [::-1] reverses the order. The indexes for feature importance must correspond to the order of variables (or the order in 'allDescp' of your code), so use these variables, you'll get the information that you want. Sincerely yours, Shojiro On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari wrote: > Hello rdkit, > > This might be trivial but I am beginner and don't know how to do it. > > I am building a simple model to predict target property. I have pandas > dataframe (df) whose columns are 'SMILES' and 'Target'. > > #calculating the descriptors as below: > llDescp=[name[0] for name in Descriptors._descList] > calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp) > df ['fp']=df['SMILES'].apply(lambda x: > calc.CalcDescriptors(Chem.MolFromSmiles(x))) > > #converting the fingerprint to numpy array > y=df['Target'].values > X=np.array(list(df['fp'])) > > #preprocessing > X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25, > random_state=42) > st=StandardScaler() > X=st.fit_transform(X) > > #random forest model > model=RandomForestRegressor(n_estimators=10) > model.fit(X_train, y_train) > > My problem is that I don't know how to get the meaningful > feature_importance. The following will return the values of descriptors > but there is no labels and so I don't know how to figure out which features > are important. > > print (sorted (rfregress.feature_importances_)) > > Thanks for your help! > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- The University of Tokyo 2nd year Ph.D. candidate Shojiro Shibayama -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] enumeration of smiles question
ons) subsequently transmitted > from Firmenich, are confidential and solely for the use of the intended > recipient. The contents do not represent the opinion of Firmenich except to > the extent that it relates to their official business. > > *** > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot__ > _ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > *** > DISCLAIMER > This email and any files transmitted with it, including replies and > forwarded copies (which may contain alterations) subsequently transmitted > from Firmenich, are confidential and solely for the use of the intended > recipient. The contents do not represent the opinion of Firmenich except to > the extent that it relates to their official business. > > *** > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- The University of Tokyo 2nd year Ph.D. candidate Shojiro Shibayama -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] How can I count the substructures with RDKit?
Dear Takayuki, Thank you for your reply. What I want to do is to count substructures based on fragments in MACCS key, not to count the number of types of fragments that appear in a molecule. My temporary measure is to simply count the substructures using `mol.GetSubstructMatches`. A sample code is here: https://gist.github.com/sshojiro/c156c351fbc4e05e478a6acc1b7d4949 But, right now, 1: isotope, 125: aromatic ring, and 166: fragments are ignored because their corresponding SMARTS are simply '?', which seems incompatible with GetSubstructMatches. If you know some alternative ways of implementation, it'd be so much help if you let me know that. Thanks in advance! Best regards, Shojiro On 4 August 2018 at 22:32, Taka Seri wrote: > Dear Shojiro, > > To count the number of on bits, you can use GetNumOnBits. > http://www.rdkit.org/Python_Docs/rdkit.DataStructs. > cDataStructs.ExplicitBitVect-class.html#GetNumOnBits > > from rdkit import Chem > > from rdkit.Chem import AllChem > > mol = Chem.MolFromSmiles('O1ccnccc1') > > maccsfp = AllChem.GetMACCSKeysFingerprint(mol) > > print(macsfp.GetNumOnBits()) > > # output is 16 > > > Kind regards, > > Takayuki > > 2018年8月4日(土) 17:14 Shojiro Shibayama : > >> Hi, community members, >> >> I'm looking for a way to count all fragments that I give for some >> quantitative analysis. I want the count data based on e.g. MACCS key's >> fragments instead of MACCS key 0/1 descriptor itself. Could anyone please >> help me with this? Thanks in advance. >> >> Sincerely, >> Shojiro >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot__ >> _ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- The University of Tokyo 2nd year Ph.D. candidate Shojiro Shibayama -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] How can I count the substructures with RDKit?
Hi, community members, I'm looking for a way to count all fragments that I give for some quantitative analysis. I want the count data based on e.g. MACCS key's fragments instead of MACCS key 0/1 descriptor itself. Could anyone please help me with this? Thanks in advance. Sincerely, Shojiro -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Naming files in a loop
Hi, I don't know the version of python you use, but the following code must be effective in python 3.5 or so: for i in chemicals: Draw.MolToFile(i, 'Desktop/{}.png'.format(i) ) Or you should use zip() for the for-loop to insert corresponding compounds' names. Best, Shojiro On Wed, Jul 11, 2018 at 9:04 AM Phuong Chau wrote: > Hello, > I have a list of chemicals such as chemicals=["Cc1c1", > "C=Cc1c1","CCCc1c1"] and I want to use Draw.MolToFile to draw 2D > structure image of each of them. However, I am not sure how to name it > differently in the for loop. Like for example: > for i in chemicals: > Draw.MolToFile(i, 'Desktop/i.png') > > I want the image file name has the name of the SMILES string of that > chemcal such as Cc1c1.png. Is it possible for me to do that in the > Python Script? or do I have to do it one by one? > > Thank you so much for your help! > > -- > Phuong Chau > Smith College '20 > Engineering Major > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss