Re: [scikit-learn] Request / Proposal: integrating IEEE paper in scikit-learn as "feature_selection.EFS / EFSCV" and cancer_benchmark datasets
starting with the Efroymson stepwise regression, the selection of relevant regressors has a long history. Of course, Efroymson's case is an old and simple one in a very wide set of more general problems where the number of variables and the missingness pattern make things very hard to tackle. I had a look at the paper that seems to me to be based on a wide review of the literature and an in depth focus on the main extant algorithms. I do not feel as an expert about the matter. However, the subject is so important that, in view of the thorough analysis the authors performed, I think this enterprise worthwhile. My best regards. Ulderico Santarelli. Il giorno dom 24 set 2023 alle ore 11:12 Dalibor Hrg ha scritto: > Dear scikit-learn mailing list > > similarly to standing feature_selection. > <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE>*RFE > and RFECV*, this is a request to openly discuss the *PROPOSAL* and > requirements of *feature_selection.EFS and/or EFSCV* which would stand > for "Evolutionary Feature Selection" with starting 8 algorithms or methods > to be used with scikit-learn estimators, just as published in IEEE > https://arxiv.org/abs/2303.10182 by the authors of paper. They agreed to > help integrate it (in cc). > > *PROPOSAL* > Implement/integrate https://arxiv.org/abs/2303.10182 paper into > scikit-learn: > > *1) CODE* > >- implementing *feature_selection.EFS and/or EFSC*V (a space for >evolutionary computing community interested in feature selection) > > RFE is: > > feature_selection. > <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE> > *RFE*(estimator, *[, ...]) > > Feature ranking with recursive feature elimination. > > feature_selection.RFECV > <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV> > (estimator, *[, ...]) > > Recursive feature elimination with cross-validation to select features. > The "EFS" could be: > > feature_selection. > <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE> > *EFS*(estimator, *[, ...]) > > Feature ranking and feature elimination with *8 different algorithms, > SFE, SFE-PSO* etc. *<- new algorithms could be added and benchmarked with > evolutionary computing, swarm, genetic etc. * > > feature_selection. > <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV> > *EFSCV*(estimator, *[, ...]) > > Feature elimination with cross-validation to select features > > *2) DATASETS & CANCER BENCHMARK* > >- curating and integrating fetch of *cancer_benchmark* 40 datasets, >directly in scikit-learn or externally pullable somehow and maintained >(space for contributing expanding high-dimensional datasets on cancer >topics). > > fetch_c > <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing> > ancer-benchmark(*[,, ...]) > > Loads 40 individual cancer related high-dimensional datasets for > benchmarking feature selection methods (classification). > > *3) TUTORIAL / WEBSITE* > >- writing tutorial to replicate IEEE paper results with > *feature_selection.EFS >and/or EFSCV* on *cancer_benchmark (40 datasets)* > > > I have identified IEEE work https://arxiv.org/abs/2303.10182 to be of > very interesting novelty in working with high-dimensional datasets as it > reports small subsets of predictive features selected with SVM, KNN across > 40 datasets. Replicability under BSD-3 and high quality under scikit-learn > could assure benchmarking novel feature selection algorithms easier - in my > very first opinion. Since this is the very first touch of myself with IEEE > paper authors and the scikit-learn list altogether, we would welcome some > help/guide how integration could work out, and if there is any interest on > that line at all. > > Kind regards > Dalibor Hrg > https://www.linkedin.com/in/daliborhrg/ > > > On Sat, Sep 23, 2023 at 9:08 AM Alexandre Gramfort < > alexandre.gramf...@inria.fr> wrote: > >> Dear Dalibor >> >> you should discuss this on the main scikit-learn mailing list. >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> Alex >> >> On Fri, Sep 22, 2023 at 12:19 PM Dalibor Hrg >> wrote: >> >>> Dear sklearn feature_selection.RFE Team and IEEE Authors (in-cc), >>> >>> This
Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE
of course. Here it is Il giorno lun 18 set 2023 alle ore 18:10 Jaime Lopez ha scritto: > Hi, > > Same error, maybe it could be related to the database I got from github > (iris.xlsx), could you share yours?. > > [image: image.png] > > JL > > On Mon, Sep 18, 2023 at 1:57 AM Ulderico Santarelli < > ulderico.santare...@gmail.com> wrote: > >> *I think it better to send you the script in its integrity. I ran now and >> it works. * >> *about work it is* >> work >> array([[ 5.63011247], >>[-2.31453939], >>[22.23122848], >>[15.37678101]]) >> np.shape(work) >> (4, 1) >> >> *my best regards. * >> *Ulderico.* >> >> _ >> import numpy as np >> import pandas as pd >> dataraw = pd.read_excel("C:\Pyth\iris.xlsx") >> #standardize data --- dataraw is a DataFrame >> #locate data in the DataFrame >> datar = dataraw.iloc[:,1:5] >> means = datar.mean(axis = 0) >> stdev = datar.std(axis = 0) >> data = (datar-means)/stdev >> #keep just quantitative variables >> #CENTRALITY INDEX >> scalar = pd.merge(data, data, how = 'cross') >> point1 = scalar.loc[:, 'sepal length _x':'petal width _x'] >> point2 = scalar.loc[:, 'sepal length _y':'petal width _y'] >> apoint1 = point1.to_numpy(dtype = float) >> apoint2 = point2.to_numpy(dtype = float) >> delta = (apoint1 - apoint2) >> force = 0 >> if delta.any() != 0: >> force = np.exp(-abs(delta)) >> sig = np.sign(delta) >> sforce = sig*force >> dsforce = pd.DataFrame(sforce) >> #dsforce.to_excel('C:\Pyth\dsforce.xlsx') >> arr = np.ones((150, 1),) >> sforcet = sforce.T >> sum_force =np.zeros((1, 4),) #do not use empty arrays >> start = 0 >> end = 150 >> for i in range(150): >> s_forcet = sforcet[:, start:end] >> work = np.matmul(s_forcet, arr) >> sum_force =np.concatenate((sum_force, work.reshape(1, 4)), axis = 0) >> start = end >> end +=150 >> sumforce = sum_force[1:, :] >> dsumforce = pd.DataFrame(sumforce) >> dsumforce.to_excel('C:\Pyth\sumforce_sqc.xlsx') >> sum_force_square = sumforce**2 >> ssT = np.ones((4, 1),) >> T_w_ = np.sqrt(np.matmul(sum_force_square, ssT)) >> dT_w_ = pd.DataFrame(T_w_, ) >> dT_w_.to_excel('C:\Pyth\T_w_.xlsx') >> >> Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez >> ha scritto: >> >>> Hi there, >>> >>> I got interested in your project, but I found this error from the >>> beginning (see attached image). >>> The work array cannot be reshaped to (1,4), cause it has shape (2,1), >>> any suggestions? >>> >>> JL >>> >>> [image: image.png] >>> >>> On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli < >>> ulderico.santare...@gmail.com> wrote: >>> >>>> *I am an old guy who started programming around the seventies of >>>> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM >>>> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about >>>> the powerful, flexible, functionally complete PYTHON UNIVERSE”, >>>> encompassing an advanced Object-Oriented Language and a very wide family of >>>> packages, I decided to run an exercise about a problem I've been >>>> tackling since my youth (have a look at the Bibliography). I succeeded in >>>> completing it in a few days and I'm attaching my solution to the problem of >>>> finding the points in a sample that are "central" in a surrounding >>>> topological neighborhood. They are eligible as centroids for a Cluster >>>> Analysis after the aggregation of "too near points'. The solution is based >>>> on the search of potential wells in a suitable potential field, similar to >>>> the one all of us studied in high school. Therefore, too near points may be >>>> in the same potential well. >>>> No more words, have a look at the attachment. >>>> My coding is that of a beginner. I'm sure everybody would find more >>>> efficient coding. As a comment: I started studying Python around May 15th >>>> 2023. >>>> My best regards. >>>> Ulderico Santarelli. >>>> ___ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> -- >>> >>> *Jaime Lopez Carvajal* >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > > *Jaime Lopez Carvajal* > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > iris.xlsx Description: MS-Excel 2007 spreadsheet ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE
in addition, *the distance I'm using is not a dogma*. It is meant to avoid the "black holes syndrome" that would emerge using the sheer Newtonian distance when by chance two points are too near. When the distance is 0, exp(-|w-x|) would be 1 and is set to 0. I tried also exp{-|w-x|^2) but changes are not significant. Ulderico. Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez ha scritto: > Hi there, > > I got interested in your project, but I found this error from the > beginning (see attached image). > The work array cannot be reshaped to (1,4), cause it has shape (2,1), any > suggestions? > > JL > > [image: image.png] > > On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli < > ulderico.santare...@gmail.com> wrote: > >> *I am an old guy who started programming around the seventies of >> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM >> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about >> the powerful, flexible, functionally complete PYTHON UNIVERSE”, >> encompassing an advanced Object-Oriented Language and a very wide family of >> packages, I decided to run an exercise about a problem I've been >> tackling since my youth (have a look at the Bibliography). I succeeded in >> completing it in a few days and I'm attaching my solution to the problem of >> finding the points in a sample that are "central" in a surrounding >> topological neighborhood. They are eligible as centroids for a Cluster >> Analysis after the aggregation of "too near points'. The solution is based >> on the search of potential wells in a suitable potential field, similar to >> the one all of us studied in high school. Therefore, too near points may be >> in the same potential well. >> No more words, have a look at the attachment. >> My coding is that of a beginner. I'm sure everybody would find more >> efficient coding. As a comment: I started studying Python around May 15th >> 2023. >> My best regards. >> Ulderico Santarelli. >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > > *Jaime Lopez Carvajal* > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE
*I think it better to send you the script in its integrity. I ran now and it works. * *about work it is* work array([[ 5.63011247], [-2.31453939], [22.23122848], [15.37678101]]) np.shape(work) (4, 1) *my best regards. * *Ulderico.* _ import numpy as np import pandas as pd dataraw = pd.read_excel("C:\Pyth\iris.xlsx") #standardize data --- dataraw is a DataFrame #locate data in the DataFrame datar = dataraw.iloc[:,1:5] means = datar.mean(axis = 0) stdev = datar.std(axis = 0) data = (datar-means)/stdev #keep just quantitative variables #CENTRALITY INDEX scalar = pd.merge(data, data, how = 'cross') point1 = scalar.loc[:, 'sepal length _x':'petal width _x'] point2 = scalar.loc[:, 'sepal length _y':'petal width _y'] apoint1 = point1.to_numpy(dtype = float) apoint2 = point2.to_numpy(dtype = float) delta = (apoint1 - apoint2) force = 0 if delta.any() != 0: force = np.exp(-abs(delta)) sig = np.sign(delta) sforce = sig*force dsforce = pd.DataFrame(sforce) #dsforce.to_excel('C:\Pyth\dsforce.xlsx') arr = np.ones((150, 1),) sforcet = sforce.T sum_force =np.zeros((1, 4),) #do not use empty arrays start = 0 end = 150 for i in range(150): s_forcet = sforcet[:, start:end] work = np.matmul(s_forcet, arr) sum_force =np.concatenate((sum_force, work.reshape(1, 4)), axis = 0) start = end end +=150 sumforce = sum_force[1:, :] dsumforce = pd.DataFrame(sumforce) dsumforce.to_excel('C:\Pyth\sumforce_sqc.xlsx') sum_force_square = sumforce**2 ssT = np.ones((4, 1),) T_w_ = np.sqrt(np.matmul(sum_force_square, ssT)) dT_w_ = pd.DataFrame(T_w_, ) dT_w_.to_excel('C:\Pyth\T_w_.xlsx') Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez ha scritto: > Hi there, > > I got interested in your project, but I found this error from the > beginning (see attached image). > The work array cannot be reshaped to (1,4), cause it has shape (2,1), any > suggestions? > > JL > > [image: image.png] > > On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli < > ulderico.santare...@gmail.com> wrote: > >> *I am an old guy who started programming around the seventies of >> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM >> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about >> the powerful, flexible, functionally complete PYTHON UNIVERSE”, >> encompassing an advanced Object-Oriented Language and a very wide family of >> packages, I decided to run an exercise about a problem I've been >> tackling since my youth (have a look at the Bibliography). I succeeded in >> completing it in a few days and I'm attaching my solution to the problem of >> finding the points in a sample that are "central" in a surrounding >> topological neighborhood. They are eligible as centroids for a Cluster >> Analysis after the aggregation of "too near points'. The solution is based >> on the search of potential wells in a suitable potential field, similar to >> the one all of us studied in high school. Therefore, too near points may be >> in the same potential well. >> No more words, have a look at the attachment. >> My coding is that of a beginner. I'm sure everybody would find more >> efficient coding. As a comment: I started studying Python around May 15th >> 2023. >> My best regards. >> Ulderico Santarelli. >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > > *Jaime Lopez Carvajal* > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE
I'm going to have a look at this. Thank you for your comment. Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez ha scritto: > Hi there, > > I got interested in your project, but I found this error from the > beginning (see attached image). > The work array cannot be reshaped to (1,4), cause it has shape (2,1), any > suggestions? > > JL > > [image: image.png] > > On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli < > ulderico.santare...@gmail.com> wrote: > >> *I am an old guy who started programming around the seventies of >> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM >> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about >> the powerful, flexible, functionally complete PYTHON UNIVERSE”, >> encompassing an advanced Object-Oriented Language and a very wide family of >> packages, I decided to run an exercise about a problem I've been >> tackling since my youth (have a look at the Bibliography). I succeeded in >> completing it in a few days and I'm attaching my solution to the problem of >> finding the points in a sample that are "central" in a surrounding >> topological neighborhood. They are eligible as centroids for a Cluster >> Analysis after the aggregation of "too near points'. The solution is based >> on the search of potential wells in a suitable potential field, similar to >> the one all of us studied in high school. Therefore, too near points may be >> in the same potential well. >> No more words, have a look at the attachment. >> My coding is that of a beginner. I'm sure everybody would find more >> efficient coding. As a comment: I started studying Python around May 15th >> 2023. >> My best regards. >> Ulderico Santarelli. >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > > *Jaime Lopez Carvajal* > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE
*I am an old guy who started programming around the seventies of the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about the powerful, flexible, functionally complete PYTHON UNIVERSE”, encompassing an advanced Object-Oriented Language and a very wide family of packages, I decided to run an exercise about a problem I've been tackling since my youth (have a look at the Bibliography). I succeeded in completing it in a few days and I'm attaching my solution to the problem of finding the points in a sample that are "central" in a surrounding topological neighborhood. They are eligible as centroids for a Cluster Analysis after the aggregation of "too near points'. The solution is based on the search of potential wells in a suitable potential field, similar to the one all of us studied in high school. Therefore, too near points may be in the same potential well. No more words, have a look at the attachment. My coding is that of a beginner. I'm sure everybody would find more efficient coding. As a comment: I started studying Python around May 15th 2023. My best regards. Ulderico Santarelli. SAMPLE POINTS CENTRALITY INDEX.docx Description: MS-Word 2007 document ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] (no subject)
___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn