Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance

2018-08-22 Thread Ali Eftekhari
Hi Shojiro,

This might not be the most elegant and efficient way but it worked for what
I wanted to do.  I changed my apply function as below:
allDescp=[name[0] for name in Descriptors._descList]
for name in allDescp:
temp=MoleculeDescriptors.MolecularDescriptorCalculator([name])
df[name]=df['SMILES'].apply(lambda x:
temp.CalcDescriptors(Chem.MolFromSmiles(x))[0])

y=df['Target].values
X=np.array(df.drop('Target'))

features_list = X.columns.values[0::]

model= RandomForestRegressor(n_estimators = 1000, random_state = 42)
model.fit(X,y)
feature_importance = model.feature_importances_

threshold = 5
important_index = np.where(feature_importance > threshold)[0]
important_features = features_list[important_index]

Ali

On Tue, Aug 21, 2018 at 6:51 AM Ali Eftekhari 
wrote:

> hi Shojiro,
>
> Thanks for your response but print
> (np.argsort(rfregress.feature_importances_)[::-1]) returns the row indices
> but what I want is the column names so it can give me information which
> features are important.
>
> On Mon, Aug 20, 2018 at 9:31 PM Shojiro Shibayama 
> wrote:
>
>> Dear Ali,
>>
>> Please run first the following code, which may help you:
>>
>> ```python
>> import numpy as np
>> np.argsort(rfregress.feature_importances_)[::-1]
>> ```
>>
>> The `argsort` will return the indexes of the important features in
>> ascending order and [::-1] reverses the order.
>> The indexes for feature importance must correspond to the order of
>> variables (or the order in 'allDescp' of your code), so use these
>> variables, you'll get the information that you want.
>>
>> Sincerely yours,
>> Shojiro
>>
>>
>> On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari 
>> wrote:
>>
>>> Hello rdkit,
>>>
>>> This might be trivial but I am beginner and don't know how to do it.
>>>
>>> I am building a simple model to predict target property.  I have pandas
>>> dataframe (df) whose columns are 'SMILES' and 'Target'.
>>>
>>> #calculating the descriptors as below:
>>> llDescp=[name[0] for name in Descriptors._descList]
>>> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp)
>>> df ['fp']=df['SMILES'].apply(lambda x:
>>> calc.CalcDescriptors(Chem.MolFromSmiles(x)))
>>>
>>> #converting  the fingerprint to numpy array
>>> y=df['Target'].values
>>> X=np.array(list(df['fp']))
>>>
>>> #preprocessing
>>> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25,
>>> random_state=42)
>>> st=StandardScaler()
>>> X=st.fit_transform(X)
>>>
>>> #random forest model
>>> model=RandomForestRegressor(n_estimators=10)
>>> model.fit(X_train, y_train)
>>>
>>> My problem is that I don't know how to get the meaningful
>>> feature_importance.  The following will return the values of descriptors
>>> but there is no labels and so I don't know how to figure out which features
>>> are important.
>>>
>>> print (sorted (rfregress.feature_importances_))
>>>
>>> Thanks for your help!
>>>
>>>
>>>
>>>
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>> 
>> The University of Tokyo
>> 2nd year Ph.D. candidate
>>   Shojiro Shibayama
>> 
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance

2018-08-21 Thread Ali Eftekhari
hi Shojiro,

Thanks for your response but print
(np.argsort(rfregress.feature_importances_)[::-1]) returns the row indices
but what I want is the column names so it can give me information which
features are important.

On Mon, Aug 20, 2018 at 9:31 PM Shojiro Shibayama 
wrote:

> Dear Ali,
>
> Please run first the following code, which may help you:
>
> ```python
> import numpy as np
> np.argsort(rfregress.feature_importances_)[::-1]
> ```
>
> The `argsort` will return the indexes of the important features in
> ascending order and [::-1] reverses the order.
> The indexes for feature importance must correspond to the order of
> variables (or the order in 'allDescp' of your code), so use these
> variables, you'll get the information that you want.
>
> Sincerely yours,
> Shojiro
>
>
> On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari 
> wrote:
>
>> Hello rdkit,
>>
>> This might be trivial but I am beginner and don't know how to do it.
>>
>> I am building a simple model to predict target property.  I have pandas
>> dataframe (df) whose columns are 'SMILES' and 'Target'.
>>
>> #calculating the descriptors as below:
>> llDescp=[name[0] for name in Descriptors._descList]
>> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp)
>> df ['fp']=df['SMILES'].apply(lambda x:
>> calc.CalcDescriptors(Chem.MolFromSmiles(x)))
>>
>> #converting  the fingerprint to numpy array
>> y=df['Target'].values
>> X=np.array(list(df['fp']))
>>
>> #preprocessing
>> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25,
>> random_state=42)
>> st=StandardScaler()
>> X=st.fit_transform(X)
>>
>> #random forest model
>> model=RandomForestRegressor(n_estimators=10)
>> model.fit(X_train, y_train)
>>
>> My problem is that I don't know how to get the meaningful
>> feature_importance.  The following will return the values of descriptors
>> but there is no labels and so I don't know how to figure out which features
>> are important.
>>
>> print (sorted (rfregress.feature_importances_))
>>
>> Thanks for your help!
>>
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> 
> The University of Tokyo
> 2nd year Ph.D. candidate
>   Shojiro Shibayama
> 
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance

2018-08-20 Thread Shojiro Shibayama
Dear Ali,

Please run first the following code, which may help you:

```python
import numpy as np
np.argsort(rfregress.feature_importances_)[::-1]
```

The `argsort` will return the indexes of the important features in
ascending order and [::-1] reverses the order.
The indexes for feature importance must correspond to the order of
variables (or the order in 'allDescp' of your code), so use these
variables, you'll get the information that you want.

Sincerely yours,
Shojiro


On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari  wrote:

> Hello rdkit,
>
> This might be trivial but I am beginner and don't know how to do it.
>
> I am building a simple model to predict target property.  I have pandas
> dataframe (df) whose columns are 'SMILES' and 'Target'.
>
> #calculating the descriptors as below:
> llDescp=[name[0] for name in Descriptors._descList]
> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp)
> df ['fp']=df['SMILES'].apply(lambda x:
> calc.CalcDescriptors(Chem.MolFromSmiles(x)))
>
> #converting  the fingerprint to numpy array
> y=df['Target'].values
> X=np.array(list(df['fp']))
>
> #preprocessing
> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25,
> random_state=42)
> st=StandardScaler()
> X=st.fit_transform(X)
>
> #random forest model
> model=RandomForestRegressor(n_estimators=10)
> model.fit(X_train, y_train)
>
> My problem is that I don't know how to get the meaningful
> feature_importance.  The following will return the values of descriptors
> but there is no labels and so I don't know how to figure out which features
> are important.
>
> print (sorted (rfregress.feature_importances_))
>
> Thanks for your help!
>
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 

The University of Tokyo
2nd year Ph.D. candidate
  Shojiro Shibayama

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance

2018-08-20 Thread Ali Eftekhari
Hello rdkit,

This might be trivial but I am beginner and don't know how to do it.

I am building a simple model to predict target property.  I have pandas
dataframe (df) whose columns are 'SMILES' and 'Target'.

#calculating the descriptors as below:
llDescp=[name[0] for name in Descriptors._descList]
calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp)
df ['fp']=df['SMILES'].apply(lambda x:
calc.CalcDescriptors(Chem.MolFromSmiles(x)))

#converting  the fingerprint to numpy array
y=df['Target'].values
X=np.array(list(df['fp']))

#preprocessing
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25,
random_state=42)
st=StandardScaler()
X=st.fit_transform(X)

#random forest model
model=RandomForestRegressor(n_estimators=10)
model.fit(X_train, y_train)

My problem is that I don't know how to get the meaningful
feature_importance.  The following will return the values of descriptors
but there is no labels and so I don't know how to figure out which features
are important.

print (sorted (rfregress.feature_importances_))

Thanks for your help!
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss