Re: [scikit-learn] What is the ECCN (Export Control Classification Number) number & COO (Country of Origin) of scikit-learn

2022-01-10 Thread Roman Yurchak
Hi Anup, as far as I know scikit-learn is not export controlled. It has no components that belong to that classification. You can check yourself though the lists provided for instance in links of this blog post https://www.magicsplat.com/blog/ear/index.html to determine it. Though I'm not a

Re: [scikit-learn] New core dev: Julien Jerphanion

2021-10-30 Thread Roman Yurchak
Congratulations, Julian, and thank for all your work! Roman On 30/10/2021 11:18, Guillaume Lemaître wrote: The scikit-learn core development team has welcomed a new member, Julien Jerphanion, who has contributed code, reviews, and documentation since this March (aside from occasional contribut

Re: [scikit-learn] sklearn-porter support

2021-05-04 Thread Roman Yurchak
Hi Joe, sklearn-porter is a nice project, however people on this mailing list are not really involved with its development. You would likely get more relevant answers to your questions by asking the author directly, for instance in a Github issue. I'm sure they would appreciate an offer to h

Re: [scikit-learn] [ANN] scikit-learn 0.24.2 is online!

2021-04-28 Thread Roman Yurchak
Thanks for making this bug fix release happen! -- Roman On 28/04/2021 19:23, Guillaume Lemaître wrote: scikit-learn 0.24.2 is out on pypi.org and conda-forge! This is a small maintenance release that fixes a couple of regressions: https://scikit-learn.org/stable/whats_new/v0

Re: [scikit-learn] Issue in BIRCH clustering algo

2021-02-11 Thread Roman Yurchak
It's a known issue, see https://github.com/scikit-learn/scikit-learn/issues/17966 Someone would would need to investigate more to find a fix though. If you have a minimal reproducible example that's different from the one in that issue, and could post it there it would help. Roman On 11/02/20

Re: [scikit-learn] imbalanced datasets return uncalibrated predictions - why?

2020-11-17 Thread Roman Yurchak
On 17/11/2020 09:57, Sole Galli via scikit-learn wrote: And I understand that it has to do with the cost function, because if we re-balance the dataset with say class_weight = 'balance'. then the probabilities seem to be calibrated as a result. As far I know, logistic regression will have well

Re: [scikit-learn] Voting software

2020-04-28 Thread Roman Yurchak
i36> *From:* scikit-learn on behalf of Roman Yurchak *Sent:* Monday, April 27, 2020 10:30:49 AM *To:* Scikit-learn user and developer mailing list *Subject:* Re: [scikit-learn] Voting software BTW, could we use some online voting softwa

Re: [scikit-learn] Voting software

2020-04-27 Thread Roman Yurchak
BTW, could we use some online voting software for votes? Just to avoid filling public email threads with +1s. For instance CPython uses https://www.python.org/dev/peps/pep-8001/ but it is anonymous. Does anyone know a simple non anonymous one preferably linked to Github authentication? On 27/

Re: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee

2020-04-27 Thread Roman Yurchak
+1 On 27/04/2020 15:20, Jeremie du Boisberranger wrote: +1 On 27/04/2020 15:18, Nicolas Hug wrote: +1 On 4/27/20 9:16 AM, Gael Varoquaux wrote: +1 And thank you very much Adrin! On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: Hi All. Given all his recent contributions, I

Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

2020-03-27 Thread Roman Yurchak
Very interesting! A few comments, > From GH17, we managed to extract only 10.5k pipelines. The relatively low frequency (with respect to the number of notebooks using SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. However, the number of pipelines in the GH19 corpus is

Re: [scikit-learn] transfer learning doubt

2020-03-19 Thread Roman Yurchak
On 19/03/2020 14:19, Farzana Anowar wrote: > Another option is to us deep learning and store the weights for the first model and initialize the second model with that weight and keep doing it for the rest of the models. This can also be done in scikit-learn with models that support warm_start

Re: [scikit-learn] Update of 'Upcoming events' on the scikit-learn wiki

2019-12-06 Thread Roman Yurchak
Thank you, Chiara! I think announcing some of the main planned sprints on the mailing list and twitter would be helpful. Last sprint (in London) contributors were interested in knowing how they could follow when next sprints would happen, and we didn't have a clear answer then (short of follow

Re: [scikit-learn] Vote on SLEP010: n_features_in_ attribute

2019-12-06 Thread Roman Yurchak
On 04/12/2019 20:44, Joel Nothman wrote: I am +1 for this, but I think we should look at how to make these new validation methods usable by external developers +1 for the SLEP and for finding a way to make this method usable by external developers maybe as part of the developer API. _

Re: [scikit-learn] Monthly meetings

2019-11-13 Thread Roman Yurchak
Thanks for the reminder! Is there a way to put these periodic meetings in a calendar (either in some shared calendar or as calendar invitations for people who are likely to participate/were there last time) ? Cheers, Roman On 13/11/2019 23:14, Nicolas Hug wrote: Hey everyone, The next mont

Re: [scikit-learn] scikit-learn twitter account

2019-11-05 Thread Roman Yurchak
Maybe re-purposing? I'm not sure if people find useful the current approach of a tweet per PR. It would make things less confusing to have 1 account. Looking how other OSS projects do this would also be interesting. On 05/11/2019 06:14, Andreas Mueller wrote: Should we re-purpose the existing

Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Roman Yurchak
Ben, I can confirm your results with penalty='none' and C=1e9. In both cases, you are running a mostly unpenalized logisitic regression. Usually that's less numerically stable than with a small regularization, depending on the data collinearity. Running that same code with - larger penalty

Re: [scikit-learn] Vote on SLEP009: keyword only arguments

2019-09-16 Thread Roman Yurchak
+1 assuming we are careful about continuing to allow some frequently used positional arguments, even in __init__. For instance, n_components = 10 pca = PCA(n_components) is still more readable, I think, than, pca = PCA(n_components=n_components) -- Roman On 15/09/2019 00:21, Thomas J Fan w

Re: [scikit-learn] scikit-learn website and documentation

2019-09-02 Thread Roman Yurchak
Hello Chiara, as far as I understood scikit-learn#14849 started as an incremental improvement of the scikit-learn website and ended up as a more in depth rewrite of the sphinx theme. If you have any comments or suggestions don't hesitate to comment on that issue. For instance, that PR went w

Re: [scikit-learn] titanic dataset, use for book

2019-06-27 Thread Roman Yurchak via scikit-learn
Meanwhile, loading the CSV from OpenML (https://www.openml.org/d/40945) would also work, pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl') -- Roman On 25/06/2019 17:04, Andreas Mueller wrote: > By the time your book comes out, it's likely to be merged, but might not > be r

Re: [scikit-learn] Normalization in ridge regression when there is no intercept

2019-06-07 Thread Roman Yurchak via scikit-learn
On 06/06/2019 14:56, ahmetcik wrote: > I have just recognized that when using ridge regression without an > intercept no normalization is performed even if the argument "normalize" > is set to True. It's a known longstanding issue https://github.com/scikit-learn/scikit-learn/issues/3020 It would

Re: [scikit-learn] Starting to contribute

2019-04-07 Thread Roman Yurchak via scikit-learn
Hello Heitor, yes, you can chose an issue, comment there that you plan to work on it (to avoid redundant work by other contributors) and if no one objects make a PR. If you have any questions you can ask them by commenting on that issue (for specific questions) or on the scikit-learn Gitter ht

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Roman Yurchak via scikit-learn
+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting functions will be added? If it's just a dozen or less, putting them all into a single namespace sklearn.plot might be easier. This also would avoid discussion about where to put some generic plotting functions (e.g. https://g

Re: [scikit-learn] Sprint discussion points?

2019-02-19 Thread Roman Yurchak via scikit-learn
Thanks for putting the draft schedule together! Personally I will be there 3 days out of 5 and wouldn't want to miss the discussion on euclidean distance issues. Maybe we could adjust the schedule during the sprint (say on Tuesday) based on people's interest and availability? That might be easi

Re: [scikit-learn] VOTE: scikit-learn governance document

2019-02-11 Thread Roman Yurchak via scikit-learn
+1 as well Roman On 11/02/2019 09:47, Gael Varoquaux wrote: > +1 on my side too. > > Thanks a lot Andy for moving this forward. > > Gaël > > On Mon, Feb 11, 2019 at 07:53:51AM +, Vlad Niculae wrote: >> +1 > >> Thank you for the effort to formalize this! > >> Best, >> Vlad > >> On Mon, F

Re: [scikit-learn] Next Sprint

2018-12-22 Thread Roman Yurchak via scikit-learn
That works for me as well. On 21/12/2018 16:00, Olivier Grisel wrote: > Ok for me. The last 3 weeks of February are fine for me. > > Le jeu. 20 déc. 2018 à 21:21, Alexandre Gramfort > mailto:alexandre.gramf...@inria.fr>> a écrit : > > ok for me > > Alex > > On Thu, Dec 20, 2018 at

Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
gt; depends on where the bottleneck is. > I'm really surprised they are not used more, but maybe that's just > because implementations are missing? > > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: >> Hi Matthieu, >> >> if you are interested in ge

Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
Hi Matthieu, if you are interested in general questions regarding improving scikit-learn performance, you might be want to have a look at the draft roadmap https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- there is a lot topics where suggestions / PRs on improving performa

Re: [scikit-learn] Update or downgrade PCA

2018-07-03 Thread Roman Yurchak
Hi Pamphile, On 03/07/18 10:41, Pamphile Roy wrote: I have some code that allows to upgrade (or downgrade) a PCA with a new sample. The update part is handy when you are doing live observations for instance and you want a quick way to update your PCA without having to recompute the whole thing

Re: [scikit-learn] Error

2018-05-21 Thread Roman Yurchak
Try opening an issue at their Github issue tracker https://github.com/scikit-multilearn/scikit-multilearn/issues ; providing a detailed description of the issue takes some time but would also make it more likely to get an answer there (see https://stackoverflow.com/help/mcve). -- Roman On 21

Re: [scikit-learn] DBScan freezes my computer !!!

2018-05-13 Thread Roman Yurchak
Could you please check memory usage while running DBSCAN to make sure freezing is due to running out of memory and not to something else? Which parameters do you run DBSCAN with? Changing algorithm, leaf_size parameters and ensuring n_jobs=1 could help. Assuming eps is reasonable, I think it sh

Re: [scikit-learn] Multi learn error.

2018-05-04 Thread Roman Yurchak
Hi Aijaz, On 05/05/18 07:31, aijaz qazi wrote: > Dear developers of Scikit , Scikit is short for SciPy Toolkits (https://www.scipy.org/scikits.html); there is a number of those. Scikit-learn started as one (and this is the scikit-learn mailing list). The package you are refering is based on

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Roman Yurchak
Hi Yacine, On 29/01/18 16:39, Yacine MAZARI wrote: >> I wouldn't hate if length normalisation was added to if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. T

Re: [scikit-learn] Text classification of large dataet

2017-12-20 Thread Roman Yurchak
Ranjana, have a look at this example http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html Since you have a lot of RAM, you may not need to make all the classification pipeline out-of-core, a start with your current code could be to write a generator

Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak
@python.org> <mailto:scikit-learn-ow...@python.org <mailto:scikit-learn-ow...@python.org>> > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > >

Re: [scikit-learn] unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak
On 16/10/17 17:16, Ismael Lemhadri wrote: My concern is actually not about not mentioning the scaling but about not mentioning the centering. That is, the sklearn PCA removes the mean but it does not mention it in the help file. I think it's currently assumed given the definition of the PCA, bu

Re: [scikit-learn] unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak
Ismael, as far as I saw the sklearn.decomposition.PCA doesn't mention scaling at all (except for the whiten parameter which is post-transformation scaling). So since it doesn't mention it, it makes sense that it doesn't do any scaling of the input. Same as np.linalg.svd. You can verify that

Re: [scikit-learn] Accessing Clustering Feature Tree in Birch

2017-10-02 Thread Roman Yurchak
reference to the individual samples that contributed to each node, but stores some statistics on their basis. Roman Yurchak has, however, offered a PR where, for the non-online case, storage of the indices contributing to each node can be optionally turned on: https://github.com/sci

Re: [scikit-learn] TF-IDF

2017-10-02 Thread Roman Yurchak
Hi Apurva, if you consider the operations done by the augmented frequency and the cosine normalization independently from everything else, they are somewhat similar. The normalization by max in a p-norm with p→+∞ . So apart from the 0.5 offset, both are can be seen document length normalizati

Re: [scikit-learn] Accessing Clustering Feature Tree in Birch

2017-08-23 Thread Roman Yurchak
> what are the data samples in this cluster Mehmet's response below works for exploring the hierarchical tree. However, Birch currently doesn't store the data samples that belong to a given subcluster. If you need that, as far as I know, a reasonable approximation can be obtained by computing

Re: [scikit-learn] How can i write the birch prediction results to the file

2017-08-22 Thread Roman Yurchak
Hello Sema, On 22/08/17 11:24, Sema Atasever wrote: > "joblib.dump" produces a file format with npy extension so I can not open the file with the notepad editor. I can not see the predictions results inside the file. Is there another way to save the prediction results in text format? Predic

Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-07-06 Thread Roman Yurchak
Hello Sema, On 05/07/17 13:27, Sema Atasever wrote: How can i know which cluster member represents best each cluster? You could try to pick the one that's closest to the cluster centroid.. In the birch code i use this code line: *centroids = brc.subcluster_centers_* How do I interpret this l

Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-07-03 Thread Roman Yurchak
rive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_web> ​ On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak mailto:rth.yurc...@gmail.com>> wrote: Hello Sema, On 30/06/17 17:14, Sema Atasever wrote: I want to cluster them using Birch clustering algorithm

Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-06-30 Thread Roman Yurchak
Hello Sema, On 30/06/17 17:14, Sema Atasever wrote: I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option. No it doesn't, see http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html so you would need to provide it with the ori

Re: [scikit-learn] Machine learning for PU data

2017-06-30 Thread Roman Yurchak
Hello Ruchika, I don't think that scikit-learn currently has algorithms that can train with positive and unlabeled class labels only. However, you could try one of the following compatible wrappers, - http://nktmemo.github.io/jekyll/update/2015/11/07/pu_classification.html - https://githu

Re: [scikit-learn] How to dump a model to txt file?

2017-04-14 Thread Roman Yurchak
Also, there is an effort on converting trained scikit-learn models to other languages (e.g. C) in https://github.com/nok/sklearn-porter but it does not support GradientBoostingRegressor (yet). On 13/04/17 23:27, federico vaggi wrote: If you want to use the model from C++ code, the easiest way i

Re: [scikit-learn] best way to scale on the random forest for text w bag of words ...

2017-03-16 Thread Roman Yurchak
If you run out of memory at the prediction step, splitting the test dataset in batches, then concatenating the results should work fine. Why would it "skew" the results? 70GB RAM seems huge: for comparison here is some categorization benchmarks on a 700k text dataset, that use more in the orde

Re: [scikit-learn] Error while using GridSearchCV.

2017-03-07 Thread Roman Yurchak
Shubham, the definition of ShuffleSplit.__init__ is ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None) you are passing the n_split parameter twice (once named and once as the first parameter), as the exception that you getting says, -- Roman On 07/03/17 14:24, Shub

Re: [scikit-learn] Roc curve from multilabel classification has slope

2017-01-08 Thread Roman Yurchak
José, I might be misunderstanding something, but wouldn't it make more sens to plot one ROC curve for every class in your result (using all samples at once), as opposed to plotting it for every training sample as you are doing now? Cf the example below, http://scikit-learn.org/stable/auto_examples

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-27 Thread Roman Yurchak
Hi Debu, On 27/12/16 08:18, Andrew Howe wrote: > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 > % on probability thresholds above 0.5 Getting a high True Positive Rate (recall) is not a sufficient condition for a well behaved model. Though 0.1 recall is still p

Re: [scikit-learn] Specifying exceptions to ParameterGrid

2016-11-25 Thread Roman Yurchak
On 24/11/16 09:00, Jaidev Deshpande wrote: > > well, `param_grid` in GridSearchCV can also be a list of dictionaries, > so you could directly specify the cases you are interested in (instead > of the full grid - exceptions), which might be simpler? > > > Actually now that I think of

Re: [scikit-learn] Specifying exceptions to ParameterGrid

2016-11-23 Thread Roman Yurchak
Hi Jaidev, well, `param_grid` in GridSearchCV can also be a list of dictionaries, so you could directly specify the cases you are interested in (instead of the full grid - exceptions), which might be simpler? On 23/11/16 11:15, Jaidev Deshpande wrote: > Hi, > > Sometimes when using GridSearchCV,

Re: [scikit-learn] hierarchical clustering

2016-11-04 Thread Roman Yurchak
Hi Jaime, Alternatively, in scikit learn I think, you could use hac = AgglomerativeClustering(n_clusters, linkage="ward") hac.fit(data) clusters = hac.labels_ there in an example on how to plot a dendrogram from this in https://github.com/scikit-learn/scikit-learn/pull/3464 Agglomerat

Re: [scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values?

2016-10-01 Thread Roman Yurchak
On 01/10/16 15:34, Moyi Dang wrote: > However, I don't understand why the negatives are there in the first > place, or what they mean. I'm not sure if the absolute values are > corresponding to the token counts. > > Can someone please help explain what the HashingVectorizer is doing? How > do I ge

Re: [scikit-learn] Issue with sklearn.neural_network

2016-09-09 Thread Roman Yurchak
Ibrahim, I believe the sklearn.neural_network.MLPClassifier was added in the not yet released v0.18 (current dev version), http://scikit-learn.org/dev/modules/neural_networks_supervised.html -- Roman On 09/09/16 10:19, Ibrahim Dalal via scikit-learn wrote: > Dear Developers, > > I am using sklear

Re: [scikit-learn] Confidence Estimation for Regressor Predictions

2016-09-01 Thread Roman Yurchak
oad, Johns Creek, GA 30097 | dale.t.sm...@macys.com > > > -Original Message- > From: scikit-learn > [mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of > Roman Yurchak > Sent: Thursday, September 1, 2016 3:45 PM > To: Scikit-learn user and d

Re: [scikit-learn] Confidence Estimation for Regressor Predictions

2016-09-01 Thread Roman Yurchak
I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors. In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard

Re: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

2016-08-29 Thread Roman Yurchak
Thank you for all your responses! In the LSA what is equivalent, I think, is - to apply a L2 normalization (not the StandardScaler) after the LSA and then compute the cosine similarity between document vectors simply as a dot product. - not apply the L2 normalization and call the `cosine_sim

[scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

2016-08-26 Thread Roman Yurchak
Hi all, I have a question about using the TruncatedSVD method for performing Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply applying TruncatedSVD to a tf-idf matrice is sufficient (cf. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Re: [scikit-learn] memory efficient feature extraction

2016-06-06 Thread Roman Yurchak
Hi Joel, thanks for your response. On 06/06/16 14:29, Joel Nothman wrote: > - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). >

[scikit-learn] memory efficient feature extraction

2016-06-06 Thread Roman Yurchak
Dear all, I was wondering if somebody could advise on the best way for generating/storing large sparse feature sets that do not fit in memory? In particular, I have the following workflow, Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR array on disk -> Training a classifie