Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?

2019-11-27 Thread Shojiro Shibayama
Dear Thomas,

You can get the SMILES of substructures that are extracted via
`GetMorganFingerprint` function as follows. Then, you can append any labels
to the SMILES string but not real numbers.

```python
from rdkit import Chem
mol = Chem.MolFromSmiles('Cc1n1')
info = {}
AllChem.GetMorganFingerprint(mol, radius=2, bitInfo=info)
radius, atom_id = list(info.values())[0][0][::-1]
env = Chem.FindAtomEnvironmentOfRadiusN(mol, radius, atom_id)
sub_struct = Chem.PathToSubmol(mol, env)
type(sub_struct) #=> rdkit.Chem.rdchem.Mol
Chem.MolToSmiles(sub_struct) #=>  'ccc'
```

Best,

On Fri, 22 Nov 2019 at 23:40, Thomas Evangelidis  wrote:

> Greetings,
>
> Could someone please clarify how can I pass atomic partial charges to the
> ECFP fingerprint generator along with the default atomic properties that it
> considers? Can I pass the real charge values or do I have to group them
> into bins and pass the bin identifier? I found a function in utilsFP.py
> file which generates invariants as follows:
>
> def generateAtomInvariant(mol):
> """
> >>> generateAtomInvariant(Chem.MolFromSmiles("Cc1n1"))
> [341294046, 3184205312, 522345510, 1545984525, 1545984525, 1545984525, 
> 1545984525]
> """
> num_atoms = mol.GetNumAtoms()
> invariants = [0]*num_atoms
> for i,a in enumerate(mol.GetAtoms()):
> descriptors=[]
> descriptors.append(a.GetAtomicNum())
> descriptors.append(a.GetTotalDegree())
> descriptors.append(a.GetTotalNumHs())
> descriptors.append(a.IsInRing())
> descriptors.append(a.GetIsAromatic())
> invariants[i]=hash(tuple(descriptors))& 0x
> return invariants
>
>
> And then generate the fingerprint like this:
>
>
> fp = AllChem.GetMorganFingerprint(mol, radius=3, 
> invariants=generateAtomInvariant(mol))
>
>
> Would just suffice to add this extra line in generateAtomInvariant() function?
>
>
> descriptors.append(a.GetFormalCharge())
>
>
>
> I thank you in advance.
> Thomas
>
>
>
> --
>
> ==
>
> Dr. Thomas Evangelidis
>
> Research Scientist
>
> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
> Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, 
> Prague,
> Czech Republic
>   &
> CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>
> , Brno, Czech Republic
>
> email: teva...@gmail.com, Twitter: tevangelidis
> <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis
> <https://www.linkedin.com/in/thomas-evangelidis-495b45125/>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 

The University of Tokyo
2nd year Ph.D. candidate
  Shojiro Shibayama

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Dividing inputstream over threads

2019-01-20 Thread Shojiro Shibayama
Hi,

A python standard library multiprocessing may help you to parallelize your
code.

I wrote a code that converts SMILES to hashed MorganFP using parallel
computation in the following short post. The code took 10 mins for 1.5m
compounds when 6 processes were used.
https://loudspeaker.sakura.ne.jp/devblog/2019/01/20/python-multiprocessing-write-strings-single/

multiprocessing.Pool.imap can be incorporated into for loop, which safely
accesses to a text file or even your SQL. I guess SQLalchemy in python
might be good, but I'm not sure. Hope that you'll find out a good library
of SQL OR mapper for python.

Sincerely yours,
Shojiro


On Tue, 15 Jan 2019, 01:54 Andreas Luttens  Hi!
>
> I have developed a small script that calculates molecules properties for
> molecules that are stored in a SMILES file. The properties should be stored
> in an SQL database, which works fine, but I would like to speed up the
> process a bit. I was thinking of implementing some parallelization for the
> calculating of properties and storing into separate connections to my SQL
> database. I have done this before in Python with OpenEye and seems to be
> doing the trick. I would however want my code to useable by people who do
> not hold a license for OpenEye, which is why I try RDKit. I would like my
> code to be in C++ as well.
>
> I was wondering how I would tackle this problem. Does the RDKit have a
> similar functionality as an "oemolithread" to chunk up the incoming stream?
> I haven't found something like this when I first scrolled through
> documentation. If it is not implemented, how would I divide the work on
> incoming molecules over N threads?
>
> All help is very appreciated. Thanks in advance.
>
> Best regards,
>
> Andreas Luttens
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance

2018-08-20 Thread Shojiro Shibayama
Dear Ali,

Please run first the following code, which may help you:

```python
import numpy as np
np.argsort(rfregress.feature_importances_)[::-1]
```

The `argsort` will return the indexes of the important features in
ascending order and [::-1] reverses the order.
The indexes for feature importance must correspond to the order of
variables (or the order in 'allDescp' of your code), so use these
variables, you'll get the information that you want.

Sincerely yours,
Shojiro


On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari  wrote:

> Hello rdkit,
>
> This might be trivial but I am beginner and don't know how to do it.
>
> I am building a simple model to predict target property.  I have pandas
> dataframe (df) whose columns are 'SMILES' and 'Target'.
>
> #calculating the descriptors as below:
> llDescp=[name[0] for name in Descriptors._descList]
> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp)
> df ['fp']=df['SMILES'].apply(lambda x:
> calc.CalcDescriptors(Chem.MolFromSmiles(x)))
>
> #converting  the fingerprint to numpy array
> y=df['Target'].values
> X=np.array(list(df['fp']))
>
> #preprocessing
> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25,
> random_state=42)
> st=StandardScaler()
> X=st.fit_transform(X)
>
> #random forest model
> model=RandomForestRegressor(n_estimators=10)
> model.fit(X_train, y_train)
>
> My problem is that I don't know how to get the meaningful
> feature_importance.  The following will return the values of descriptors
> but there is no labels and so I don't know how to figure out which features
> are important.
>
> print (sorted (rfregress.feature_importances_))
>
> Thanks for your help!
>
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 

The University of Tokyo
2nd year Ph.D. candidate
  Shojiro Shibayama

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Shojiro Shibayama
ons) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
> 
> ***
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
> _
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> 
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
> 
> ***
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>


-- 

The University of Tokyo
2nd year Ph.D. candidate
  Shojiro Shibayama

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How can I count the substructures with RDKit?

2018-08-04 Thread Shojiro Shibayama
Dear Takayuki,

Thank you for your reply. What I want to do is to count substructures based
on fragments in MACCS key, not to count the number of types of fragments
that appear in a molecule. My temporary measure is to simply count the
substructures using `mol.GetSubstructMatches`. A sample code is here:
https://gist.github.com/sshojiro/c156c351fbc4e05e478a6acc1b7d4949

But, right now, 1: isotope, 125: aromatic ring, and 166: fragments are
ignored because their corresponding SMARTS are simply '?', which seems
incompatible with GetSubstructMatches.

If you know some alternative ways of implementation, it'd be so much help
if you let me know that. Thanks in advance!

Best regards,
Shojiro

On 4 August 2018 at 22:32, Taka Seri  wrote:

> Dear Shojiro,
>
> To count the number of on bits, you can use GetNumOnBits.
> http://www.rdkit.org/Python_Docs/rdkit.DataStructs.
> cDataStructs.ExplicitBitVect-class.html#GetNumOnBits
>
> from rdkit import Chem
>
> from rdkit.Chem import AllChem
>
> mol = Chem.MolFromSmiles('O1ccnccc1')
>
> maccsfp = AllChem.GetMACCSKeysFingerprint(mol)
>
> print(macsfp.GetNumOnBits())
>
> # output is 16
>
>
> Kind regards,
>
> Takayuki
>
> 2018年8月4日(土) 17:14 Shojiro Shibayama :
>
>> Hi, community members,
>>
>> I'm looking for a way to count all fragments that I give for some
>> quantitative analysis. I want the count data based on e.g. MACCS key's
>> fragments instead of MACCS key 0/1 descriptor itself. Could anyone please
>> help me with this? Thanks in advance.
>>
>> Sincerely,
>> Shojiro
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
>> _
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>


-- 

The University of Tokyo
2nd year Ph.D. candidate
  Shojiro Shibayama

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] How can I count the substructures with RDKit?

2018-08-04 Thread Shojiro Shibayama
Hi, community members,

I'm looking for a way to count all fragments that I give for some
quantitative analysis. I want the count data based on e.g. MACCS key's
fragments instead of MACCS key 0/1 descriptor itself. Could anyone please
help me with this? Thanks in advance.

Sincerely,
Shojiro
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Naming files in a loop

2018-07-11 Thread Shojiro Shibayama
Hi,

I don't know the version of python you use, but the following code must be
effective in python 3.5 or so:

for i in chemicals:
Draw.MolToFile(i, 'Desktop/{}.png'.format(i) )

Or you should use zip() for the for-loop to insert corresponding compounds'
names.

Best,
Shojiro

On Wed, Jul 11, 2018 at 9:04 AM Phuong Chau  wrote:

> Hello,
> I have a list of chemicals such as chemicals=["Cc1c1",
> "C=Cc1c1","CCCc1c1"] and I want to use Draw.MolToFile to draw 2D
> structure image of each of them. However, I am not sure how to name it
> differently in the for loop. Like for example:
> for i in chemicals:
> Draw.MolToFile(i, 'Desktop/i.png')
>
> I want the image file name has the name of the SMILES string of that
> chemcal such as Cc1c1.png. Is it possible for me to do that in the
> Python Script? or do I have to do it one by one?
>
> Thank you so much for your help!
>
> --
> Phuong Chau
> Smith College '20
> Engineering Major
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss