Re: [Rdkit-discuss] RDKit descriptors batch

2010-06-02 Thread Greg Landrum
Dear DK,

Cedric's answer is a good start that looks like it would work. I'd
refine it a little bit and use a somewhat different mechanism for
calculating the descriptors. Here's a piece of python that, given a
descriptor calculator file (see below) will do what I think you want:

#---
from rdkit import Chem
from rdkit.RDLogger import logger
logger=logger()
import cPickle,sys

calc = cPickle.load(file('moe_like.dsc','rb'))
nms = list(calc.GetDescriptorNames())
suppl = Chem.SmilesMolSupplier(sys.argv[1],titleLine=False)
w = Chem.SmilesWriter(sys.argv[2])
w.SetProps(nms)
nDone=0
for mol in suppl:
nDone += 1
if not nDone%1000: logger.info("Done %d"%nDone)
if mol is None: continue
descrs = calc.CalcDescriptors(mol)
for nm,v in zip(nms,descrs):
mol.SetProp(nm,str(v))
w.write(mol)
#---

The script uses the first argument as the input file name and the
second argument as the output file name. It uses a descriptor
calculator that it loads from a file named "moe_like.dsc" (there's a
file with this name that will work in the directory
$RDBASE/Projects/DbCLI/).

To make this file easier to run, I'd suggest wrapping it in a shell
script (linux/mac) or bat file (windows) that sets the RDBASE
environment variable, the PATH, and the LD_LIBRARY_PATH (linux) or
DYLD_LIBRARY_PATH (mac). On windows you can call pythonw.exe insted of
python.exe to avoid opening a new window.

So what's a descriptor calculator? This is a mechanism provided by the
RDKit that allows you to package a set of descriptors together for
easy reuse. It's useful if you don't want to generate everything (as
Cedric's script does) or want to be sure you always generate the same
descriptors in the same order (the version from Cedric will generate
new descriptors as they become available; these new descriptors could
change the ordering of the old ones).

Here's an example of how to create a new descriptor calculator (from
Python) and then save it to a .dsc file you could use in the sample
script above. In case you aren't familiar with python at all, this is
showing what I typed at the python prompt and how python responded:
In [1]: from rdkit.ML.Descriptors.MoleculeDescriptors import
MolecularDescriptorCalculator
In [2]: calc = 
MolecularDescriptorCalculator(['MolLogP','NOCount','NHOHCount','MolWt','NumRotatableBonds','TPSA'])
In [3]: import cPickle
In [4]: cPickle.dump(calc,file('simple_2d.dsc','w+'))

And, to show how the calculator is used inside Python:

In [5]: from rdkit import Chem
In [6]: m = Chem.MolFromSmiles('c1n1CC(=O)O')
In [7]: calc.CalcDescriptors(m)
Out[7]: (0.708699989, 3, 1, 137.138001, 2, 50.188)

Best Regards,
-greg

On Wed, Jun 2, 2010 at 11:29 AM, Cedric MORETTI
 wrote:
> Not tested
>
>
>
>
>
> # script RD_descript.py
>
> print "Hello from RD_descript "
>
>
>
> from cinfony import rdk
>
> from rdkit import Chem
>
> from rdkit.Chem import AvailDescriptors
>
>
>
> for d in AvailDescriptors.descDict:
>
>    print d
>
>
>
>
>
> suppl = open("Nom file","r")
>
> w = Chem.SDWriter(“SDF File”)
>
> numRead = 0
>
> numStructures = 0
>
> for m in suppl:
>
>    numRead += 1
>
>    if m != None:
>
>   numStructures += 1
>
>   smi = Chem.MolToSmiles(m.strip())
>
>   m.SetProp("SMILES",smi)
>
>   print smi
>
>   for d in AvailDescriptors.descDict:
>
> #  print d
>
>  pr = AvailDescriptors.descDict[d]( m.strip())
>
> #  print str(pr)
>
>  m.SetProp(d,str(pr))
>
>   w.write(m)
>
>
>
> print "nombre initiale = " + str(numRead )
>
> print "nombre finale = " + str(numStructures)
>
>
>
> From: Damjan Krstajic [mailto:dkrsta...@hotmail.com]
> Sent: mercredi, 2. juin 2010 11:21
> To: rdkit-discuss@lists.sourceforge.net
> Subject: [Rdkit-discuss] RDKit descriptors batch
>
>
>
> Hello,
>
> I would like to use RDKit to calculate descriptors. I am interested in a
> batch program which would calculate the RDKit descriptors from a smiles file
> (.smi). I don't have any experience with Python. Do you have any advice on
> how to create the batch program? I am prepared to code it and give it to you
> so that others can use it.
>
> Thanks
> DK
>
> 
>
> Get a new e-mail account with Hotmail - Free. Sign-up now.
>
> **
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
> **
>
>
> --
>
>
> ___
> Rdkit-disc

Re: [Rdkit-discuss] RDKit descriptors batch

2010-06-02 Thread Cedric MORETTI
Not tested


# script RD_descript.py
print "Hello from RD_descript "

from cinfony import rdk
from rdkit import Chem
from rdkit.Chem import AvailDescriptors

for d in AvailDescriptors.descDict:
   print d


suppl = open("Nom file","r")
w = Chem.SDWriter("SDF File")
numRead = 0
numStructures = 0
for m in suppl:
   numRead += 1
   if m != None:
  numStructures += 1
  smi = Chem.MolToSmiles(m.strip())
  m.SetProp("SMILES",smi)
  print smi
  for d in AvailDescriptors.descDict:
#  print d
 pr = AvailDescriptors.descDict[d]( m.strip())
#  print str(pr)
 m.SetProp(d,str(pr))
  w.write(m)

print "nombre initiale = " + str(numRead )
print "nombre finale = " + str(numStructures)

From: Damjan Krstajic [mailto:dkrsta...@hotmail.com]
Sent: mercredi, 2. juin 2010 11:21
To: rdkit-discuss@lists.sourceforge.net
Subject: [Rdkit-discuss] RDKit descriptors batch

Hello,

I would like to use RDKit to calculate descriptors. I am interested in a batch 
program which would calculate the RDKit descriptors from a smiles file (.smi). 
I don't have any experience with Python. Do you have any advice on how to 
create the batch program? I am prepared to code it and give it to you so that 
others can use it.

Thanks
DK

Get a new e-mail account with Hotmail - Free. Sign-up 
now.

**
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.
**

--

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit descriptors

2009-01-30 Thread Greg Landrum
Hi Noel,


On Fri, Jan 30, 2009 at 10:31 AM, Noel O'Boyle  wrote:
>
> Am I right in saying that there are no 3D descriptors in RDKit? That
> they're all either 1D or 2D?

Not completely. The feature map and feature-map vector code is a 3D
descriptor. One could argue that the shape-similarity stuff is also
something of a descriptor.

-greg



Re: [Rdkit-discuss] RDKit Descriptors

2008-09-18 Thread Greg Landrum
On Thu, Sep 18, 2008 at 11:07 PM, Robert DeLisle  wrote:
> Greg,
>
> Thank you for the response.
>
> I was able to get PEOE_VSA1 through PEOE_VSA14, SMR_VSA1 through SMR_VSA10,
> and EState_VSA1 through EState_VSA11 working.  Are these the correct limits
> on the vector components?

Yes. Just in case you used a more painful approach, here's the
simplest way to check (without looking at the source in
$RDBASE/Python/Chem/MolSurf.py):
[17] >>> [x for x in AvailDescriptors.descDict.keys() if x.find('PEOE_VSA')!=-1]
Out[17]:
['PEOE_VSA14',
 'PEOE_VSA13',
 'PEOE_VSA12',
 'PEOE_VSA11',
 'PEOE_VSA10',
 'PEOE_VSA8',
 'PEOE_VSA7',
 'PEOE_VSA6',
 'PEOE_VSA5',
 'PEOE_VSA4',
 'PEOE_VSA3',
 'PEOE_VSA2',
 'PEOE_VSA1',
 'PEOE_VSA9']

> I was unable, however, to get Slogp_VSA or VSA_EState working with any
> integer suffix between 1 and 10.

That's strange. What errors were you getting?

> I've also done a correlation analysis on all the descriptors that I've
> gotten working.  After computing descriptors for some 24,000 compounds I
> removed those with less than 10% variance and limited correlations between
> variables to a maximum of 0.85 (using KNIME).  I'm happy to send a list of
> the resulting descriptors or a correlation matrix if you or anyone else is
> interested.

Sounds interesting. If you are willing, I would be happy to put this
on the wiki, linked from the descriptors page. It would be best if you
could also describe the source of the 24K compounds (or provide SMILES
for them).

-greg



Re: [Rdkit-discuss] RDKit Descriptors

2008-09-18 Thread Robert DeLisle
Greg,

Thank you for the response.

I was able to get PEOE_VSA1 through PEOE_VSA14, SMR_VSA1 through SMR_VSA10,
and EState_VSA1 through EState_VSA11 working.  Are these the correct limits
on the vector components?

I was unable, however, to get Slogp_VSA or VSA_EState working with any
integer suffix between 1 and 10.

I've also done a correlation analysis on all the descriptors that I've
gotten working.  After computing descriptors for some 24,000 compounds I
removed those with less than 10% variance and limited correlations between
variables to a maximum of 0.85 (using KNIME).  I'm happy to send a list of
the resulting descriptors or a correlation matrix if you or anyone else is
interested.



On Wed, Sep 17, 2008 at 11:36 PM, Greg Landrum wrote:

> Dear Kirk,
>
> On Thu, Sep 18, 2008 at 12:58 AM, Robert DeLisle 
> wrote:
> > I've finally found time to start using RDKit and started with descriptor
> > calculation.  Following the examples on the wiki
> > (http://code.google.com/p/rdkit/wiki/DescriptorsInTheRDKit), I get a
> > KeyError any time I attempt to obtain HeavyAtomCount, RingCount,
>
> HeavyAtomCount and RingCount were introduced after the May release, so
> they're in the subversion version of the code. They will be in the Q3
> release (which will happen sometime in the next couple of weeks,
> hopefully).
>
> > PEOP_VSA,
> > SMR_VSA, Slogp_VSA, EState_VSA, and VSA_Estate.
>
> The various X_VSA descriptors are vector-valued and you access them by
> element, so you could ask for PEOE_VSA4 or Slogp_VSA10.
>
> > (BTW, what is the
> > difference between the two last VSA descriptors?)
>
> The "standard" VSA descriptors provide map summed VSA values into bins
> determined by the other descriptor. So, for example, SMR_VSA uses
> atomic contributions to the VSA and uses bins determined by atomic
> contributions to the SMR. EState_VSA is the same, it just uses atomic
> EState values. VSA_EState is reversed: atomic EState values are put
> into bins determined by the VSA contributions.
>
> Best Regards,
> -greg
>


Re: [Rdkit-discuss] RDKit Descriptors

2008-09-17 Thread Greg Landrum
Dear Kirk,

On Thu, Sep 18, 2008 at 12:58 AM, Robert DeLisle  wrote:
> I've finally found time to start using RDKit and started with descriptor
> calculation.  Following the examples on the wiki
> (http://code.google.com/p/rdkit/wiki/DescriptorsInTheRDKit), I get a
> KeyError any time I attempt to obtain HeavyAtomCount, RingCount,

HeavyAtomCount and RingCount were introduced after the May release, so
they're in the subversion version of the code. They will be in the Q3
release (which will happen sometime in the next couple of weeks,
hopefully).

> PEOP_VSA,
> SMR_VSA, Slogp_VSA, EState_VSA, and VSA_Estate.

The various X_VSA descriptors are vector-valued and you access them by
element, so you could ask for PEOE_VSA4 or Slogp_VSA10.

> (BTW, what is the
> difference between the two last VSA descriptors?)

The "standard" VSA descriptors provide map summed VSA values into bins
determined by the other descriptor. So, for example, SMR_VSA uses
atomic contributions to the VSA and uses bins determined by atomic
contributions to the SMR. EState_VSA is the same, it just uses atomic
EState values. VSA_EState is reversed: atomic EState values are put
into bins determined by the VSA contributions.

Best Regards,
-greg