Re: [scikit-learn] why the modification in the df-idf formula?

2024-05-29 Thread Sole Galli via scikit-learn
hka.com/) > > Staff Research Engineer at Lightning AI, https://lightning.ai > > On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn > , wrote: > >> Hi guys, >> >> I'd like to understand why sklearn's implementation of tf-idf is different >> fro

[scikit-learn] why the modification in the df-idf formula?

2024-05-28 Thread Sole Galli via scikit-learn
Hi guys, I'd like to understand why sklearn's implementation of tf-idf is different from the standard textbook notation as described in the docs: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting Do you have any reference that I could take a look at? I didn't

[scikit-learn] target encoder: fit_transform vs fit.transform

2024-03-18 Thread Sole Galli via scikit-learn
Hey team, I am going over the TargetEncoder documentation and I want to make sure I understand this correctly. Is the intention of fit_transform's cross fit just to understand/ analyse / determine somehow how this transformer would perform? Because if I got this right, the attribute values

[scikit-learn] obtaining intervals from the decision tree struture

2023-03-07 Thread Sole Galli via scikit-learn
Hello, I would like to obtain final intervals from the decision tree structure. I am not interested in every node, just the limits that take a sample to a final decision /leaf. For example, if the tree structure is this one: |--- feature_0 <= 0.08 | |--- class: 0 |--- feature_0 > 0.08 |

Re: [scikit-learn] mutual information for continuous variables with scikit-learn

2023-02-01 Thread Sole Galli via scikit-learn
Hey, My understanding is that with sklearn you can compare 2 continuous variables like this: mutual_info_regression(data["var1"].to_frame(), data["var"], discrete_features=[False]) Where var1 and var are continuous. You can also compare multiple continuous variables against one continuous

Re: [scikit-learn] methods available from last estimator in pipeline

2022-09-24 Thread Sole Galli via scikit-learn
Did you try: pipeline.named_steps["the_string_name_for_knn"].kneighbours ? pipeline should be replaced by the name you gave to your pipeline and the string in named_steps is the name you have to the knn when setting the pipe. Sole Sent with Proton Mail secure email. --- Original Message

Re: [scikit-learn] View full sized k_means.labels_

2022-05-29 Thread Sole Galli via scikit-learn
Maybe with numpy.set_printoptions? See thread here: https://stackoverflow.com/questions/1987694/how-to-print-the-full-numpy-array-without-truncation Soledad Galli https://www.trainindata.com/ Sent with Proton Mail secure email. --- Original Message --- On Friday, May 13th, 2022 at

[scikit-learn] intermediate data state in a Pipeline

2022-04-11 Thread Sole Galli via scikit-learn
Hello community, Say I have a pipeline with 3 data transformations, i.e., SimpleImputer, OrdinalEncoder and StandardScaler, and a Lasso at the end. And I want to obtain a copy of the transformed data that would be input to the Lasso. Is there a way other than selecting all the steps of the

Re: [scikit-learn] random forests and multil-class probability

2021-07-27 Thread Sole Galli via scikit-learn
Nicolas > > On 27/07/2021 10:22, Guillaume Lemaître wrote: > >>> On 27 Jul 2021, at 11:08, Sole Galli via scikit-learn >>> [](mailto:scikit-learn@python.org) >>> wrote: >>> >>> Hello community, >>> >>> Do I understand correctl

Re: [scikit-learn] random forests and multil-class probability

2021-07-27 Thread Sole Galli via scikit-learn
: > > On 27 Jul 2021, at 11:08, Sole Galli via scikit-learn > > scikit-learn@python.org wrote: > > > > Hello community, > > > > Do I understand correctly that Random Forests are trained as a 1 vs rest > > when the target has more than 2 classes? S

[scikit-learn] random forests and multil-class probability

2021-07-27 Thread Sole Galli via scikit-learn
Hello community, Do I understand correctly that Random Forests are trained as a 1 vs rest when the target has more than 2 classes? Say the target takes values 0, 1 and 2, then the model would train 3 estimators 1 per class under the hood?. The predict_proba output is an array with 3 columns,

Re: [scikit-learn] function transformer

2021-06-21 Thread Sole Galli via scikit-learn
The FunctionTransformer will apply the transformation coded your function to the entire dataset passed to the transform() method. I find it hard to see how this could work to add additional columns to the dataset, but I guess it might depend on how you designed your function. Did you try

Re: [scikit-learn] check_estimator _NotAnArray

2021-05-12 Thread Sole Galli via scikit-learn
of why each test is important, and the consequences of failing this or that test. At least it would be useful for me :p Thank you! Sole ‐‐‐ Original Message ‐‐‐ On Monday, May 10, 2021 3:28 PM, Sole Galli via scikit-learn wrote: > Hello everyone, > > I am trying to get Featu

[scikit-learn] check_estimator _NotAnArray

2021-05-10 Thread Sole Galli via scikit-learn
Hello everyone, I am trying to get Feature-engine transformers pass the check_estimator tests and there is one test, that I am not too sure what it is intended for. The transformers fail the check_transformer_data_not_an_array because the input is a _NotAnArray class, and Feature-engine

[scikit-learn] IterativeImputer

2021-01-04 Thread Sole Galli via scikit-learn
Hello team, I am reading in some of the MICE original articles that supposedly, each variable should be modelled upon the other ones in the data, with a suitable model. So for example, if the variable with NA is binary, it should be modelled with classification, or if continuous with a

Re: [scikit-learn] sample_weight vs class_weight

2020-12-05 Thread Sole Galli via scikit-learn
gnostic. Arguably, allowing a dict with actual class values violates >> the above argument (of not having data-related stuff in init), so I guess >> that's where the logic ends ;) >> >> As to why one would use both, I'm not so sure honestly. >> >> Nico

Re: [scikit-learn] sample_weight vs class_weight

2020-12-04 Thread Sole Galli via scikit-learn
https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work/30982811#30982811 Soledad Galli https://www.trainindata.com/ ‐‐‐ Original Message ‐‐‐ On Thursday, December 3, 2020 11:55 AM, Sole Galli via scikit-learn wrote: > Hello team, >

[scikit-learn] sample_weight vs class_weight

2020-12-03 Thread Sole Galli via scikit-learn
Hello team, What is the difference in the implementation of class_weight and sample_weight in those algorithms that support both? like random forest or logistic regression? Are both modifying the loss function? in a similar way? Thank you! Sole___

Re: [scikit-learn] imbalanced datasets return uncalibrated predictions - why?

2020-11-18 Thread Sole Galli via scikit-learn
Thank you guys, that was actually very helpful. Best regards Sole Soledad Galli https://www.trainindata.com/ ‐‐‐ Original Message ‐‐‐ On Tuesday, November 17th, 2020 at 10:54 AM, Roman Yurchak wrote: > On 17/11/2020 09:57, Sole Galli via scikit-learn wrote: > > > And

[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

2020-11-17 Thread Sole Galli via scikit-learn
Hello team, I am trying to understand why does logistic regression return uncalibrated probabilities with values tending to low probabilities for the positive (rare) cases, when trained on an imbalanced dataset. I've read a number of articles, all seem to agree that this is the case, many

Re: [scikit-learn] Imputers and DataFrame objects

2020-08-19 Thread Sole Galli via scikit-learn
Did you have a look at the package feature-engine? It has its own imputers and encoders that allow you to select the columns to transform and returns a dataframe. It also has a sklear wrapper that wraps sklearn transformers so that they return a dataframe instead of a numpy array. Cheers.

Re: [scikit-learn] climate friendly software licence

2020-06-30 Thread Sole Galli via scikit-learn
Hi Olivier, Gabriel, and further team, Thank you so much for your views. I understand enforcement is an issue. And I don't have yet an answer on if and how the license could be enforced. I also think that this is a second step. First would be making the use of the software illegal. This would

[scikit-learn] climate friendly software licence

2020-06-29 Thread Sole Galli via scikit-learn
Hello Scikit-learn team, I've come across this: https://twitter.com/tristanharris/status/1277136696568508418?s=12 Basically, it is an initiative to include in software license a prohibition of use by fossil fuel extractivist companies. I would like to know your views on this? Is this something