Re: [Rdkit-discuss] want advice for good teaching data set

2018-08-30 Thread Andrew Dalke
Thanks for the responses. I'll merge them into one reply:


On Aug 29, 2018, at 16:56, Eloy Félix  wrote:
> If you want to build model I guess that what you want is to get experimental 
> logp values.
> 
> This should give you something to start with:
> 
> select ACTIVITY_ID, MOLREGNO, STANDARD_VALUE, STANDARD_TYPE from ACTIVITIES 
> where STANDARD_TYPE = 'LogP' and STANDARD_VALUE is not null and 
> data_validity_comment is null and POTENTIAL_DUPLICATE = 0;

Yes, that's what I was looking for, including the pointers for validity and if 
it might be a duplicate. Thanks!


On Aug 29, 2018, at 15:51, TJ O'Donnell  wrote:
> ChEMBL 24 has compound properties in the table compound_properties.  I think 
> the alogp
> is computed using (Crippen) atom types and the acd_logp is uses ACD labs 
> methods.

I can see I wasn't clear. I was looking for experimental data.

The ChEMBL blog post at 
https://chembl.blogspot.com/2018/05/chembl-24-released.html says that they 
switched to using RDKit for alogp; acd_logp is still from ACD.


On Aug 29, 2018, at 18:07, JW Feng via Rdkit-discuss 
 wrote:
> What about building QSAR models to predict activity for a particular ChEMBL 
> assay?  This would allow you to discuss strength and limitations of QSAR 
> models.


I am, primarily, a software developer working in computational chemistry. Do 
you want fast similarity search? I can do that. Do you want a maximum common 
structure algorithm, or matched molecular pair algorithm? I can do that. Do you 
want to tell me which parameters and learning algorithm you want to use? I can 
make the pieces go together.

What I don't have is the expertise to build a chemically relevant model on my 
own, and discuss its strength and weaknesses.

When I build a model, I do it to predict molecular weight. :)

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] want advice for good teaching data set

2018-08-29 Thread Eloy Félix
Hi Andrew,

If you want to build model I guess that what you want is to get
experimental logp values.

This should give you something to start with:

select ACTIVITY_ID, MOLREGNO, STANDARD_VALUE, STANDARD_TYPE from ACTIVITIES
where STANDARD_TYPE = 'LogP' and STANDARD_VALUE is not null and
data_validity_comment is null and POTENTIAL_DUPLICATE = 0;

Eloy.


2018-08-29 14:51 GMT+01:00 TJ O'Donnell :

> Hi Andrew
> ChEMBL 24 has compound properties in the table compound_properties.  I
> think the alogp
> is computed using (Crippen) atom types and the acd_logp is uses ACD labs
> methods.
> TJ
>
> On Wed, Aug 29, 2018 at 5:52 AM Andrew Dalke 
> wrote:
>
>> Hi all,
>>
>>   I am starting to put together materials for the Python/RDKit training
>> course I'm giving just before the RDKit UGM next month.
>>
>> I would like to structure part of it around the SQLite release of the
>> ChEMBL data set. More specifically, I plan to include examples of machine
>> learning with scikit-learn, using RDKit descriptors and values from ChEMBL
>> 24 (and making sure to use the new schema).
>>
>> Two problems. First, I'm not a computational chemist and I don't know
>> what would constitute a good example to use. "Good" in this case means one
>> whose outlines are well-known to likely students. Second, I don't have much
>> experience with the ChEMBL data.
>>
>> My thought is to make a logP model. The easiest would be to based it on
>> atom types. For this option, can anyone suggest where I can find logP data
>> from ChEMBL?
>>
>> Another possibility is to use a pre-existing model, like the notebook
>> George Papadatos did for Ligand-based Target Prediction at
>> http://nbviewer.jupyter.org/gist/madgpap/10457778 .
>>
>> Perhaps someone here could point me to other existing resources along
>> similar lines?
>>
>> Best regards,
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] want advice for good teaching data set

2018-08-29 Thread TJ O'Donnell
Hi Andrew
ChEMBL 24 has compound properties in the table compound_properties.  I
think the alogp
is computed using (Crippen) atom types and the acd_logp is uses ACD labs
methods.
TJ

On Wed, Aug 29, 2018 at 5:52 AM Andrew Dalke 
wrote:

> Hi all,
>
>   I am starting to put together materials for the Python/RDKit training
> course I'm giving just before the RDKit UGM next month.
>
> I would like to structure part of it around the SQLite release of the
> ChEMBL data set. More specifically, I plan to include examples of machine
> learning with scikit-learn, using RDKit descriptors and values from ChEMBL
> 24 (and making sure to use the new schema).
>
> Two problems. First, I'm not a computational chemist and I don't know what
> would constitute a good example to use. "Good" in this case means one whose
> outlines are well-known to likely students. Second, I don't have much
> experience with the ChEMBL data.
>
> My thought is to make a logP model. The easiest would be to based it on
> atom types. For this option, can anyone suggest where I can find logP data
> from ChEMBL?
>
> Another possibility is to use a pre-existing model, like the notebook
> George Papadatos did for Ligand-based Target Prediction at
> http://nbviewer.jupyter.org/gist/madgpap/10457778 .
>
> Perhaps someone here could point me to other existing resources along
> similar lines?
>
> Best regards,
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss