Re: [scikit-learn] Request / Proposal: integrating IEEE paper in scikit-learn as "feature_selection.EFS / EFSCV" and cancer_benchmark datasets

2023-09-24 Thread Ulderico Santarelli
starting with the Efroymson stepwise regression, the selection of relevant
regressors has a long history. Of course, Efroymson's case is an old and
simple one in a very wide set of more general problems where the number of
variables and the missingness pattern make things very hard to tackle.
I had a look at the paper that seems to me to be based on a wide review of
the literature and an in depth focus on the main extant algorithms. I do
not feel as an expert about the matter. However, the subject is so
important that, in view of the thorough analysis the authors performed, I
think this enterprise worthwhile.
My best regards. Ulderico Santarelli.

Il giorno dom 24 set 2023 alle ore 11:12 Dalibor Hrg 
ha scritto:

> Dear scikit-learn mailing list
>
> similarly to standing feature_selection.
> <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE>*RFE
> and RFECV*, this is a request to openly discuss the *PROPOSAL* and
> requirements of *feature_selection.EFS and/or EFSCV* which would stand
> for "Evolutionary Feature Selection" with starting 8 algorithms or methods
> to be used with scikit-learn estimators, just as published in IEEE
> https://arxiv.org/abs/2303.10182 by the authors of paper. They agreed to
> help integrate it (in cc).
>
> *PROPOSAL*
> Implement/integrate https://arxiv.org/abs/2303.10182 paper into
> scikit-learn:
>
> *1) CODE*
>
>- implementing *feature_selection.EFS and/or EFSC*V (a space for
>evolutionary computing community interested in feature selection)
>
> RFE is:
>
> feature_selection.
> <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE>
> *RFE*(estimator, *[, ...])
>
> Feature ranking with recursive feature elimination.
>
> feature_selection.RFECV
> <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV>
> (estimator, *[, ...])
>
> Recursive feature elimination with cross-validation to select features.
>  The "EFS" could be:
>
> feature_selection.
> <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE>
> *EFS*(estimator, *[, ...])
>
> Feature ranking and feature elimination with *8 different algorithms,
> SFE, SFE-PSO* etc. *<- new algorithms could be added and benchmarked with
> evolutionary computing, swarm, genetic etc. *
>
> feature_selection.
> <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV>
> *EFSCV*(estimator, *[, ...])
>
> Feature elimination with cross-validation to select features
>
> *2) DATASETS & CANCER BENCHMARK*
>
>- curating and integrating fetch of *cancer_benchmark* 40 datasets,
>directly in scikit-learn or externally pullable somehow and maintained
>(space for contributing expanding high-dimensional datasets on cancer
>topics).
>
> fetch_c
> <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing>
> ancer-benchmark(*[,, ...])
>
> Loads 40 individual cancer related high-dimensional datasets for
> benchmarking feature selection methods (classification).
>
> *3) TUTORIAL / WEBSITE*
>
>- writing tutorial to replicate IEEE paper results with 
> *feature_selection.EFS
>and/or EFSCV* on *cancer_benchmark (40 datasets)*
>
>
> I have identified IEEE work https://arxiv.org/abs/2303.10182 to be of
> very interesting novelty in working with high-dimensional datasets as it
> reports small subsets of predictive features selected with SVM, KNN across
> 40 datasets. Replicability under BSD-3 and high quality under scikit-learn
> could assure benchmarking novel feature selection algorithms easier - in my
> very first opinion. Since this is the very first touch of myself with IEEE
> paper authors and the scikit-learn list altogether, we would welcome some
> help/guide how integration could work out, and if there is any interest on
> that line at all.
>
> Kind regards
> Dalibor Hrg
> https://www.linkedin.com/in/daliborhrg/
>
>
> On Sat, Sep 23, 2023 at 9:08 AM Alexandre Gramfort <
> alexandre.gramf...@inria.fr> wrote:
>
>> Dear Dalibor
>>
>> you should discuss this on the main scikit-learn mailing list.
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> Alex
>>
>> On Fri, Sep 22, 2023 at 12:19 PM Dalibor Hrg 
>> wrote:
>>
>>> Dear sklearn feature_selection.RFE Team and IEEE Authors (in-cc),
>>>
>>> This

Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE

2023-09-18 Thread Ulderico Santarelli
of course. Here it is

Il giorno lun 18 set 2023 alle ore 18:10 Jaime Lopez 
ha scritto:

> Hi,
>
> Same error, maybe it could be related to the database I got from github
> (iris.xlsx), could you share yours?.
>
> [image: image.png]
>
> JL
>
> On Mon, Sep 18, 2023 at 1:57 AM Ulderico Santarelli <
> ulderico.santare...@gmail.com> wrote:
>
>> *I think it better to send you the script in its integrity. I ran now and
>> it works. *
>> *about work it is*
>> work
>> array([[ 5.63011247],
>>[-2.31453939],
>>[22.23122848],
>>[15.37678101]])
>> np.shape(work)
>> (4, 1)
>>
>> *my best regards. *
>> *Ulderico.*
>>
>> _
>> import numpy as np
>> import pandas as pd
>> dataraw = pd.read_excel("C:\Pyth\iris.xlsx")
>> #standardize data --- dataraw is a DataFrame
>> #locate data in the DataFrame
>> datar = dataraw.iloc[:,1:5]
>> means = datar.mean(axis = 0)
>> stdev = datar.std(axis = 0)
>> data = (datar-means)/stdev
>> #keep just quantitative variables
>> #CENTRALITY INDEX
>> scalar = pd.merge(data, data, how = 'cross')
>> point1 = scalar.loc[:, 'sepal length _x':'petal width _x']
>> point2 = scalar.loc[:, 'sepal length _y':'petal width _y']
>> apoint1 = point1.to_numpy(dtype = float)
>> apoint2 = point2.to_numpy(dtype = float)
>> delta = (apoint1 - apoint2)
>> force = 0
>> if delta.any() != 0:
>> force = np.exp(-abs(delta))
>> sig = np.sign(delta)
>> sforce = sig*force
>> dsforce = pd.DataFrame(sforce)
>> #dsforce.to_excel('C:\Pyth\dsforce.xlsx')
>> arr = np.ones((150, 1),)
>> sforcet = sforce.T
>> sum_force =np.zeros((1, 4),)   #do not use empty arrays
>> start = 0
>> end = 150
>> for i in range(150):
>> s_forcet = sforcet[:, start:end]
>> work = np.matmul(s_forcet, arr)
>> sum_force =np.concatenate((sum_force, work.reshape(1, 4)), axis = 0)
>> start = end
>> end +=150
>> sumforce = sum_force[1:, :]
>> dsumforce = pd.DataFrame(sumforce)
>> dsumforce.to_excel('C:\Pyth\sumforce_sqc.xlsx')
>> sum_force_square = sumforce**2
>> ssT = np.ones((4, 1),)
>> T_w_ = np.sqrt(np.matmul(sum_force_square, ssT))
>> dT_w_ = pd.DataFrame(T_w_, )
>> dT_w_.to_excel('C:\Pyth\T_w_.xlsx')
>>
>> Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez 
>> ha scritto:
>>
>>> Hi there,
>>>
>>> I got interested in your project, but I found this error from the
>>> beginning (see attached image).
>>> The work array cannot be reshaped to (1,4), cause it has shape (2,1),
>>> any suggestions?
>>>
>>> JL
>>>
>>> [image: image.png]
>>>
>>> On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli <
>>> ulderico.santare...@gmail.com> wrote:
>>>
>>>>   *I am an old guy who started programming around the seventies of
>>>> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM
>>>> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about
>>>> the powerful, flexible, functionally complete PYTHON UNIVERSE”,
>>>> encompassing an advanced Object-Oriented Language and a very wide family of
>>>> packages, I decided to run an exercise about a problem I've been
>>>> tackling since my youth (have a look at the Bibliography). I succeeded in
>>>> completing it in a few days and I'm attaching my solution to the problem of
>>>> finding the points in a sample that are "central" in a surrounding
>>>> topological neighborhood. They are eligible as centroids for a Cluster
>>>> Analysis after the aggregation of "too near points'. The solution is based
>>>> on the search of potential wells in a suitable potential field, similar to
>>>> the one all of us studied in high school. Therefore, too near points may be
>>>> in the same potential well.
>>>> No more words, have a look at the attachment.
>>>> My coding is that of a beginner. I'm sure everybody would find more
>>>> efficient coding.  As a comment: I started studying Python around May 15th
>>>> 2023.
>>>> My best regards.
>>>> Ulderico Santarelli.
>>>> ___
>>>> scikit-learn mailing list
>>>> scikit-learn@python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>>
>>> --
>>>
>>> *Jaime Lopez Carvajal*
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
>
> *Jaime Lopez Carvajal*
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


iris.xlsx
Description: MS-Excel 2007 spreadsheet
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE

2023-09-18 Thread Ulderico Santarelli
in addition, *the distance I'm using is not a dogma*. It is meant to avoid
the "black holes syndrome" that would emerge using the sheer Newtonian
distance when by chance two points are too near. When the distance is 0,
exp(-|w-x|) would be 1 and is set to 0. I tried also  exp{-|w-x|^2) but
changes are not significant.
Ulderico.

Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez 
ha scritto:

> Hi there,
>
> I got interested in your project, but I found this error from the
> beginning (see attached image).
> The work array cannot be reshaped to (1,4), cause it has shape (2,1), any
> suggestions?
>
> JL
>
> [image: image.png]
>
> On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli <
> ulderico.santare...@gmail.com> wrote:
>
>>   *I am an old guy who started programming around the seventies of
>> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM
>> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about
>> the powerful, flexible, functionally complete PYTHON UNIVERSE”,
>> encompassing an advanced Object-Oriented Language and a very wide family of
>> packages, I decided to run an exercise about a problem I've been
>> tackling since my youth (have a look at the Bibliography). I succeeded in
>> completing it in a few days and I'm attaching my solution to the problem of
>> finding the points in a sample that are "central" in a surrounding
>> topological neighborhood. They are eligible as centroids for a Cluster
>> Analysis after the aggregation of "too near points'. The solution is based
>> on the search of potential wells in a suitable potential field, similar to
>> the one all of us studied in high school. Therefore, too near points may be
>> in the same potential well.
>> No more words, have a look at the attachment.
>> My coding is that of a beginner. I'm sure everybody would find more
>> efficient coding.  As a comment: I started studying Python around May 15th
>> 2023.
>> My best regards.
>> Ulderico Santarelli.
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
>
> *Jaime Lopez Carvajal*
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE

2023-09-18 Thread Ulderico Santarelli
*I think it better to send you the script in its integrity. I ran now and
it works. *
*about work it is*
work
array([[ 5.63011247],
   [-2.31453939],
   [22.23122848],
   [15.37678101]])
np.shape(work)
(4, 1)

*my best regards. *
*Ulderico.*
_
import numpy as np
import pandas as pd
dataraw = pd.read_excel("C:\Pyth\iris.xlsx")
#standardize data --- dataraw is a DataFrame
#locate data in the DataFrame
datar = dataraw.iloc[:,1:5]
means = datar.mean(axis = 0)
stdev = datar.std(axis = 0)
data = (datar-means)/stdev
#keep just quantitative variables
#CENTRALITY INDEX
scalar = pd.merge(data, data, how = 'cross')
point1 = scalar.loc[:, 'sepal length _x':'petal width _x']
point2 = scalar.loc[:, 'sepal length _y':'petal width _y']
apoint1 = point1.to_numpy(dtype = float)
apoint2 = point2.to_numpy(dtype = float)
delta = (apoint1 - apoint2)
force = 0
if delta.any() != 0:
force = np.exp(-abs(delta))
sig = np.sign(delta)
sforce = sig*force
dsforce = pd.DataFrame(sforce)
#dsforce.to_excel('C:\Pyth\dsforce.xlsx')
arr = np.ones((150, 1),)
sforcet = sforce.T
sum_force =np.zeros((1, 4),)   #do not use empty arrays
start = 0
end = 150
for i in range(150):
s_forcet = sforcet[:, start:end]
work = np.matmul(s_forcet, arr)
sum_force =np.concatenate((sum_force, work.reshape(1, 4)), axis = 0)
start = end
end +=150
sumforce = sum_force[1:, :]
dsumforce = pd.DataFrame(sumforce)
dsumforce.to_excel('C:\Pyth\sumforce_sqc.xlsx')
sum_force_square = sumforce**2
ssT = np.ones((4, 1),)
T_w_ = np.sqrt(np.matmul(sum_force_square, ssT))
dT_w_ = pd.DataFrame(T_w_, )
dT_w_.to_excel('C:\Pyth\T_w_.xlsx')

Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez 
ha scritto:

> Hi there,
>
> I got interested in your project, but I found this error from the
> beginning (see attached image).
> The work array cannot be reshaped to (1,4), cause it has shape (2,1), any
> suggestions?
>
> JL
>
> [image: image.png]
>
> On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli <
> ulderico.santare...@gmail.com> wrote:
>
>>   *I am an old guy who started programming around the seventies of
>> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM
>> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about
>> the powerful, flexible, functionally complete PYTHON UNIVERSE”,
>> encompassing an advanced Object-Oriented Language and a very wide family of
>> packages, I decided to run an exercise about a problem I've been
>> tackling since my youth (have a look at the Bibliography). I succeeded in
>> completing it in a few days and I'm attaching my solution to the problem of
>> finding the points in a sample that are "central" in a surrounding
>> topological neighborhood. They are eligible as centroids for a Cluster
>> Analysis after the aggregation of "too near points'. The solution is based
>> on the search of potential wells in a suitable potential field, similar to
>> the one all of us studied in high school. Therefore, too near points may be
>> in the same potential well.
>> No more words, have a look at the attachment.
>> My coding is that of a beginner. I'm sure everybody would find more
>> efficient coding.  As a comment: I started studying Python around May 15th
>> 2023.
>> My best regards.
>> Ulderico Santarelli.
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
>
> *Jaime Lopez Carvajal*
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE

2023-09-17 Thread Ulderico Santarelli
I'm going to have a look at this. Thank you for your comment.


Il giorno dom 17 set 2023 alle ore 18:14 Jaime Lopez 
ha scritto:

> Hi there,
>
> I got interested in your project, but I found this error from the
> beginning (see attached image).
> The work array cannot be reshaped to (1,4), cause it has shape (2,1), any
> suggestions?
>
> JL
>
> [image: image.png]
>
> On Thu, Sep 14, 2023 at 11:29 AM Ulderico Santarelli <
> ulderico.santare...@gmail.com> wrote:
>
>>   *I am an old guy who started programming around the seventies of
>> the last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM
>> APPLICATION SYSTEM and, last, the marvelous SAS. Having heard around about
>> the powerful, flexible, functionally complete PYTHON UNIVERSE”,
>> encompassing an advanced Object-Oriented Language and a very wide family of
>> packages, I decided to run an exercise about a problem I've been
>> tackling since my youth (have a look at the Bibliography). I succeeded in
>> completing it in a few days and I'm attaching my solution to the problem of
>> finding the points in a sample that are "central" in a surrounding
>> topological neighborhood. They are eligible as centroids for a Cluster
>> Analysis after the aggregation of "too near points'. The solution is based
>> on the search of potential wells in a suitable potential field, similar to
>> the one all of us studied in high school. Therefore, too near points may be
>> in the same potential well.
>> No more words, have a look at the attachment.
>> My coding is that of a beginner. I'm sure everybody would find more
>> efficient coding.  As a comment: I started studying Python around May 15th
>> 2023.
>> My best regards.
>> Ulderico Santarelli.
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
>
> *Jaime Lopez Carvajal*
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] CLUSTER ANALYSIS AND THE SEARCH OF A SAMPLE MODE

2023-09-14 Thread Ulderico Santarelli
  *I am an old guy who started programming around the seventies of the
last century* with ASSEMBLER 360, then FORTRAN, PL1, APL, IBM APPLICATION
SYSTEM and, last, the marvelous SAS. Having heard around about the
powerful, flexible, functionally complete PYTHON UNIVERSE”, encompassing an
advanced Object-Oriented Language and a very wide family of packages, I
decided to run an exercise about a problem I've been tackling since my
youth (have a look at the Bibliography). I succeeded in completing it in a
few days and I'm attaching my solution to the problem of finding the points
in a sample that are "central" in a surrounding topological neighborhood.
They are eligible as centroids for a Cluster Analysis after the aggregation
of "too near points'. The solution is based on the search of
potential wells in a suitable potential field, similar to the one all of us
studied in high school. Therefore, too near points may be in the same
potential well.
No more words, have a look at the attachment.
My coding is that of a beginner. I'm sure everybody would find more
efficient coding.  As a comment: I started studying Python around May 15th
2023.
My best regards.
Ulderico Santarelli.


SAMPLE POINTS CENTRALITY INDEX.docx
Description: MS-Word 2007 document
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] (no subject)

2023-09-14 Thread Ulderico Santarelli

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn