Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance
Hi Shojiro, This might not be the most elegant and efficient way but it worked for what I wanted to do. I changed my apply function as below: allDescp=[name[0] for name in Descriptors._descList] for name in allDescp: temp=MoleculeDescriptors.MolecularDescriptorCalculator([name]) df[name]=df['SMILES'].apply(lambda x: temp.CalcDescriptors(Chem.MolFromSmiles(x))[0]) y=df['Target].values X=np.array(df.drop('Target')) features_list = X.columns.values[0::] model= RandomForestRegressor(n_estimators = 1000, random_state = 42) model.fit(X,y) feature_importance = model.feature_importances_ threshold = 5 important_index = np.where(feature_importance > threshold)[0] important_features = features_list[important_index] Ali On Tue, Aug 21, 2018 at 6:51 AM Ali Eftekhari wrote: > hi Shojiro, > > Thanks for your response but print > (np.argsort(rfregress.feature_importances_)[::-1]) returns the row indices > but what I want is the column names so it can give me information which > features are important. > > On Mon, Aug 20, 2018 at 9:31 PM Shojiro Shibayama > wrote: > >> Dear Ali, >> >> Please run first the following code, which may help you: >> >> ```python >> import numpy as np >> np.argsort(rfregress.feature_importances_)[::-1] >> ``` >> >> The `argsort` will return the indexes of the important features in >> ascending order and [::-1] reverses the order. >> The indexes for feature importance must correspond to the order of >> variables (or the order in 'allDescp' of your code), so use these >> variables, you'll get the information that you want. >> >> Sincerely yours, >> Shojiro >> >> >> On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari >> wrote: >> >>> Hello rdkit, >>> >>> This might be trivial but I am beginner and don't know how to do it. >>> >>> I am building a simple model to predict target property. I have pandas >>> dataframe (df) whose columns are 'SMILES' and 'Target'. >>> >>> #calculating the descriptors as below: >>> llDescp=[name[0] for name in Descriptors._descList] >>> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp) >>> df ['fp']=df['SMILES'].apply(lambda x: >>> calc.CalcDescriptors(Chem.MolFromSmiles(x))) >>> >>> #converting the fingerprint to numpy array >>> y=df['Target'].values >>> X=np.array(list(df['fp'])) >>> >>> #preprocessing >>> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25, >>> random_state=42) >>> st=StandardScaler() >>> X=st.fit_transform(X) >>> >>> #random forest model >>> model=RandomForestRegressor(n_estimators=10) >>> model.fit(X_train, y_train) >>> >>> My problem is that I don't know how to get the meaningful >>> feature_importance. The following will return the values of descriptors >>> but there is no labels and so I don't know how to figure out which features >>> are important. >>> >>> print (sorted (rfregress.feature_importances_)) >>> >>> Thanks for your help! >>> >>> >>> >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> >> -- >> >> The University of Tokyo >> 2nd year Ph.D. candidate >> Shojiro Shibayama >> >> > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance
hi Shojiro, Thanks for your response but print (np.argsort(rfregress.feature_importances_)[::-1]) returns the row indices but what I want is the column names so it can give me information which features are important. On Mon, Aug 20, 2018 at 9:31 PM Shojiro Shibayama wrote: > Dear Ali, > > Please run first the following code, which may help you: > > ```python > import numpy as np > np.argsort(rfregress.feature_importances_)[::-1] > ``` > > The `argsort` will return the indexes of the important features in > ascending order and [::-1] reverses the order. > The indexes for feature importance must correspond to the order of > variables (or the order in 'allDescp' of your code), so use these > variables, you'll get the information that you want. > > Sincerely yours, > Shojiro > > > On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari > wrote: > >> Hello rdkit, >> >> This might be trivial but I am beginner and don't know how to do it. >> >> I am building a simple model to predict target property. I have pandas >> dataframe (df) whose columns are 'SMILES' and 'Target'. >> >> #calculating the descriptors as below: >> llDescp=[name[0] for name in Descriptors._descList] >> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp) >> df ['fp']=df['SMILES'].apply(lambda x: >> calc.CalcDescriptors(Chem.MolFromSmiles(x))) >> >> #converting the fingerprint to numpy array >> y=df['Target'].values >> X=np.array(list(df['fp'])) >> >> #preprocessing >> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25, >> random_state=42) >> st=StandardScaler() >> X=st.fit_transform(X) >> >> #random forest model >> model=RandomForestRegressor(n_estimators=10) >> model.fit(X_train, y_train) >> >> My problem is that I don't know how to get the meaningful >> feature_importance. The following will return the values of descriptors >> but there is no labels and so I don't know how to figure out which features >> are important. >> >> print (sorted (rfregress.feature_importances_)) >> >> Thanks for your help! >> >> >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > -- > > The University of Tokyo > 2nd year Ph.D. candidate > Shojiro Shibayama > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance
Dear Ali, Please run first the following code, which may help you: ```python import numpy as np np.argsort(rfregress.feature_importances_)[::-1] ``` The `argsort` will return the indexes of the important features in ascending order and [::-1] reverses the order. The indexes for feature importance must correspond to the order of variables (or the order in 'allDescp' of your code), so use these variables, you'll get the information that you want. Sincerely yours, Shojiro On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari wrote: > Hello rdkit, > > This might be trivial but I am beginner and don't know how to do it. > > I am building a simple model to predict target property. I have pandas > dataframe (df) whose columns are 'SMILES' and 'Target'. > > #calculating the descriptors as below: > llDescp=[name[0] for name in Descriptors._descList] > calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp) > df ['fp']=df['SMILES'].apply(lambda x: > calc.CalcDescriptors(Chem.MolFromSmiles(x))) > > #converting the fingerprint to numpy array > y=df['Target'].values > X=np.array(list(df['fp'])) > > #preprocessing > X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25, > random_state=42) > st=StandardScaler() > X=st.fit_transform(X) > > #random forest model > model=RandomForestRegressor(n_estimators=10) > model.fit(X_train, y_train) > > My problem is that I don't know how to get the meaningful > feature_importance. The following will return the values of descriptors > but there is no labels and so I don't know how to figure out which features > are important. > > print (sorted (rfregress.feature_importances_)) > > Thanks for your help! > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- The University of Tokyo 2nd year Ph.D. candidate Shojiro Shibayama -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance
Hello rdkit, This might be trivial but I am beginner and don't know how to do it. I am building a simple model to predict target property. I have pandas dataframe (df) whose columns are 'SMILES' and 'Target'. #calculating the descriptors as below: llDescp=[name[0] for name in Descriptors._descList] calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp) df ['fp']=df['SMILES'].apply(lambda x: calc.CalcDescriptors(Chem.MolFromSmiles(x))) #converting the fingerprint to numpy array y=df['Target'].values X=np.array(list(df['fp'])) #preprocessing X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25, random_state=42) st=StandardScaler() X=st.fit_transform(X) #random forest model model=RandomForestRegressor(n_estimators=10) model.fit(X_train, y_train) My problem is that I don't know how to get the meaningful feature_importance. The following will return the values of descriptors but there is no labels and so I don't know how to figure out which features are important. print (sorted (rfregress.feature_importances_)) Thanks for your help! -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss