Re: [scikit-learn] What is the ECCN (Export Control Classification Number) number & COO (Country of Origin) of scikit-learn

2022-01-10 Thread Roman Yurchak
Hi Anup, as far as I know scikit-learn is not export controlled. It has no components that belong to that classification. You can check yourself though the lists provided for instance in links of this blog post https://www.magicsplat.com/blog/ear/index.html to determine it. Though I'm not

Re: [scikit-learn] New core dev: Julien Jerphanion

2021-10-30 Thread Roman Yurchak
Congratulations, Julian, and thank for all your work! Roman On 30/10/2021 11:18, Guillaume Lemaître wrote: The scikit-learn core development team has welcomed a new member, Julien Jerphanion, who has contributed code, reviews, and documentation since this March (aside from occasional

Re: [scikit-learn] sklearn-porter support

2021-05-04 Thread Roman Yurchak
Hi Joe, sklearn-porter is a nice project, however people on this mailing list are not really involved with its development. You would likely get more relevant answers to your questions by asking the author directly, for instance in a Github issue. I'm sure they would appreciate an offer to

Re: [scikit-learn] [ANN] scikit-learn 0.24.2 is online!

2021-04-28 Thread Roman Yurchak
Thanks for making this bug fix release happen! -- Roman On 28/04/2021 19:23, Guillaume Lemaître wrote: scikit-learn 0.24.2 is out on pypi.org and conda-forge! This is a small maintenance release that fixes a couple of regressions:

Re: [scikit-learn] imbalanced datasets return uncalibrated predictions - why?

2020-11-17 Thread Roman Yurchak
On 17/11/2020 09:57, Sole Galli via scikit-learn wrote: And I understand that it has to do with the cost function, because if we re-balance the dataset with say class_weight = 'balance'. then the probabilities seem to be calibrated as a result. As far I know, logistic regression will have

Re: [scikit-learn] Voting software

2020-04-28 Thread Roman Yurchak
i36> *From:* scikit-learn on behalf of Roman Yurchak *Sent:* Monday, April 27, 2020 10:30:49 AM *To:* Scikit-learn user and developer mailing list *Subject:* Re: [scikit-learn] Voting software BTW, could we use some online voting so

Re: [scikit-learn] Voting software

2020-04-27 Thread Roman Yurchak
BTW, could we use some online voting software for votes? Just to avoid filling public email threads with +1s. For instance CPython uses https://www.python.org/dev/peps/pep-8001/ but it is anonymous. Does anyone know a simple non anonymous one preferably linked to Github authentication? On

Re: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee

2020-04-27 Thread Roman Yurchak
+1 On 27/04/2020 15:20, Jeremie du Boisberranger wrote: +1 On 27/04/2020 15:18, Nicolas Hug wrote: +1 On 4/27/20 9:16 AM, Gael Varoquaux wrote: +1 And thank you very much Adrin! On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: Hi All. Given all his recent contributions, I

Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

2020-03-27 Thread Roman Yurchak
Very interesting! A few comments, > From GH17, we managed to extract only 10.5k pipelines. The relatively low frequency (with respect to the number of notebooks using SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. However, the number of pipelines in the GH19 corpus

Re: [scikit-learn] transfer learning doubt

2020-03-19 Thread Roman Yurchak
On 19/03/2020 14:19, Farzana Anowar wrote: > Another option is to us deep learning and store the weights for the first model and initialize the second model with that weight and keep doing it for the rest of the models. This can also be done in scikit-learn with models that support

Re: [scikit-learn] Update of 'Upcoming events' on the scikit-learn wiki

2019-12-06 Thread Roman Yurchak
Thank you, Chiara! I think announcing some of the main planned sprints on the mailing list and twitter would be helpful. Last sprint (in London) contributors were interested in knowing how they could follow when next sprints would happen, and we didn't have a clear answer then (short of

Re: [scikit-learn] Vote on SLEP010: n_features_in_ attribute

2019-12-06 Thread Roman Yurchak
On 04/12/2019 20:44, Joel Nothman wrote: I am +1 for this, but I think we should look at how to make these new validation methods usable by external developers +1 for the SLEP and for finding a way to make this method usable by external developers maybe as part of the developer API.

Re: [scikit-learn] Monthly meetings

2019-11-13 Thread Roman Yurchak
Thanks for the reminder! Is there a way to put these periodic meetings in a calendar (either in some shared calendar or as calendar invitations for people who are likely to participate/were there last time) ? Cheers, Roman On 13/11/2019 23:14, Nicolas Hug wrote: Hey everyone, The next

Re: [scikit-learn] scikit-learn twitter account

2019-11-05 Thread Roman Yurchak
Maybe re-purposing? I'm not sure if people find useful the current approach of a tweet per PR. It would make things less confusing to have 1 account. Looking how other OSS projects do this would also be interesting. On 05/11/2019 06:14, Andreas Mueller wrote: Should we re-purpose the existing

Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Roman Yurchak
Ben, I can confirm your results with penalty='none' and C=1e9. In both cases, you are running a mostly unpenalized logisitic regression. Usually that's less numerically stable than with a small regularization, depending on the data collinearity. Running that same code with - larger penalty

Re: [scikit-learn] scikit-learn website and documentation

2019-09-02 Thread Roman Yurchak
Hello Chiara, as far as I understood scikit-learn#14849 started as an incremental improvement of the scikit-learn website and ended up as a more in depth rewrite of the sphinx theme. If you have any comments or suggestions don't hesitate to comment on that issue. For instance, that PR went

Re: [scikit-learn] Normalization in ridge regression when there is no intercept

2019-06-07 Thread Roman Yurchak via scikit-learn
On 06/06/2019 14:56, ahmetcik wrote: > I have just recognized that when using ridge regression without an > intercept no normalization is performed even if the argument "normalize" > is set to True. It's a known longstanding issue https://github.com/scikit-learn/scikit-learn/issues/3020 It would

Re: [scikit-learn] Starting to contribute

2019-04-07 Thread Roman Yurchak via scikit-learn
Hello Heitor, yes, you can chose an issue, comment there that you plan to work on it (to avoid redundant work by other contributors) and if no one objects make a PR. If you have any questions you can ask them by commenting on that issue (for specific questions) or on the scikit-learn Gitter

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Roman Yurchak via scikit-learn
+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting functions will be added? If it's just a dozen or less, putting them all into a single namespace sklearn.plot might be easier. This also would avoid discussion about where to put some generic plotting functions (e.g.

Re: [scikit-learn] Sprint discussion points?

2019-02-19 Thread Roman Yurchak via scikit-learn
Thanks for putting the draft schedule together! Personally I will be there 3 days out of 5 and wouldn't want to miss the discussion on euclidean distance issues. Maybe we could adjust the schedule during the sprint (say on Tuesday) based on people's interest and availability? That might be

Re: [scikit-learn] Next Sprint

2018-12-22 Thread Roman Yurchak via scikit-learn
That works for me as well. On 21/12/2018 16:00, Olivier Grisel wrote: > Ok for me. The last 3 weeks of February are fine for me. > > Le jeu. 20 déc. 2018 à 21:21, Alexandre Gramfort > mailto:alexandre.gramf...@inria.fr>> a écrit : > > ok for me > > Alex > > On Thu, Dec 20, 2018

Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
epends on where the bottleneck is. > I'm really surprised they are not used more, but maybe that's just > because implementations are missing? > > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: >> Hi Matthieu, >> >> if you are interested in general questions

Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
Hi Matthieu, if you are interested in general questions regarding improving scikit-learn performance, you might be want to have a look at the draft roadmap https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- there is a lot topics where suggestions / PRs on improving

Re: [scikit-learn] Update or downgrade PCA

2018-07-03 Thread Roman Yurchak
Hi Pamphile, On 03/07/18 10:41, Pamphile Roy wrote: I have some code that allows to upgrade (or downgrade) a PCA with a new sample. The update part is handy when you are doing live observations for instance and you want a quick way to update your PCA without having to recompute the whole

Re: [scikit-learn] Error

2018-05-21 Thread Roman Yurchak
Try opening an issue at their Github issue tracker https://github.com/scikit-multilearn/scikit-multilearn/issues ; providing a detailed description of the issue takes some time but would also make it more likely to get an answer there (see https://stackoverflow.com/help/mcve). -- Roman On

Re: [scikit-learn] DBScan freezes my computer !!!

2018-05-13 Thread Roman Yurchak
Could you please check memory usage while running DBSCAN to make sure freezing is due to running out of memory and not to something else? Which parameters do you run DBSCAN with? Changing algorithm, leaf_size parameters and ensuring n_jobs=1 could help. Assuming eps is reasonable, I think it

Re: [scikit-learn] Multi learn error.

2018-05-05 Thread Roman Yurchak
Hi Aijaz, On 05/05/18 07:31, aijaz qazi wrote: > Dear developers of Scikit , Scikit is short for SciPy Toolkits (https://www.scipy.org/scikits.html); there is a number of those. Scikit-learn started as one (and this is the scikit-learn mailing list). The package you are refering is based on

Re: [scikit-learn] Text classification of large dataet

2017-12-20 Thread Roman Yurchak
Ranjana, have a look at this example http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html Since you have a lot of RAM, you may not need to make all the classification pipeline out-of-core, a start with your current code could be to write a generator

Re: [scikit-learn] unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak
On 16/10/17 17:16, Ismael Lemhadri wrote: My concern is actually not about not mentioning the scaling but about not mentioning the centering. That is, the sklearn PCA removes the mean but it does not mention it in the help file. I think it's currently assumed given the definition of the PCA,

Re: [scikit-learn] unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak
Ismael, as far as I saw the sklearn.decomposition.PCA doesn't mention scaling at all (except for the whiten parameter which is post-transformation scaling). So since it doesn't mention it, it makes sense that it doesn't do any scaling of the input. Same as np.linalg.svd. You can verify

Re: [scikit-learn] Accessing Clustering Feature Tree in Birch

2017-10-02 Thread Roman Yurchak
nline is that it loses any reference to the individual samples that contributed to each node, but stores some statistics on their basis. Roman Yurchak has, however, offered a PR where, for the non-online case, storage of the indices contributing to each node can be optionall

Re: [scikit-learn] Accessing Clustering Feature Tree in Birch

2017-08-23 Thread Roman Yurchak
> what are the data samples in this cluster Mehmet's response below works for exploring the hierarchical tree. However, Birch currently doesn't store the data samples that belong to a given subcluster. If you need that, as far as I know, a reasonable approximation can be obtained by computing

Re: [scikit-learn] How can i write the birch prediction results to the file

2017-08-22 Thread Roman Yurchak
Hello Sema, On 22/08/17 11:24, Sema Atasever wrote: > "joblib.dump" produces a file format with npy extension so I can not open the file with the notepad editor. I can not see the predictions results inside the file. Is there another way to save the prediction results in text format?

Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-07-06 Thread Roman Yurchak
Hello Sema, On 05/07/17 13:27, Sema Atasever wrote: How can i know which cluster member represents best each cluster? You could try to pick the one that's closest to the cluster centroid.. In the birch code i use this code line: *centroids = brc.subcluster_centers_* How do I interpret this

Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-07-03 Thread Roman Yurchak
e.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_web> ​ On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurc...@gmail.com <mailto:rth.yurc...@gmail.com>> wrote: Hello Sema, On 30/06/17 17:14, Sema Atasever wrote: I want to cluster them using Birch cluster

Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-06-30 Thread Roman Yurchak
Hello Sema, On 30/06/17 17:14, Sema Atasever wrote: I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option. No it doesn't, see http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html so you would need to provide it with the

Re: [scikit-learn] best way to scale on the random forest for text w bag of words ...

2017-03-16 Thread Roman Yurchak
If you run out of memory at the prediction step, splitting the test dataset in batches, then concatenating the results should work fine. Why would it "skew" the results? 70GB RAM seems huge: for comparison here is some categorization benchmarks on a 700k text dataset, that use more in the

Re: [scikit-learn] Roc curve from multilabel classification has slope

2017-01-08 Thread Roman Yurchak
José, I might be misunderstanding something, but wouldn't it make more sens to plot one ROC curve for every class in your result (using all samples at once), as opposed to plotting it for every training sample as you are doing now? Cf the example below,

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-27 Thread Roman Yurchak
Hi Debu, On 27/12/16 08:18, Andrew Howe wrote: > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 > % on probability thresholds above 0.5 Getting a high True Positive Rate (recall) is not a sufficient condition for a well behaved model. Though 0.1 recall is still

Re: [scikit-learn] Specifying exceptions to ParameterGrid

2016-11-25 Thread Roman Yurchak
On 24/11/16 09:00, Jaidev Deshpande wrote: > > well, `param_grid` in GridSearchCV can also be a list of dictionaries, > so you could directly specify the cases you are interested in (instead > of the full grid - exceptions), which might be simpler? > > > Actually now that I think of

Re: [scikit-learn] Confidence Estimation for Regressor Predictions

2016-09-01 Thread Roman Yurchak
I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors. In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard

Re: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

2016-08-29 Thread Roman Yurchak
Thank you for all your responses! In the LSA what is equivalent, I think, is - to apply a L2 normalization (not the StandardScaler) after the LSA and then compute the cosine similarity between document vectors simply as a dot product. - not apply the L2 normalization and call the

[scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

2016-08-26 Thread Roman Yurchak
Hi all, I have a question about using the TruncatedSVD method for performing Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply applying TruncatedSVD to a tf-idf matrice is sufficient (cf.

Re: [scikit-learn] memory efficient feature extraction

2016-06-06 Thread Roman Yurchak
Hi Joel, thanks for your response. On 06/06/16 14:29, Joel Nothman wrote: > - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). >