Thanks for making it happen!
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
I think it could be implemented as a preprocessing step: this is the
approach followed by:
https://github.com/ryankiros/skip-thoughts/blob/master/eval_classification.py
Note that in that case LogisticRegression is used as the final
classifier instead of a squared hinge loss SVM but that should not
> However, at present, IsolationForest only fits data in batch even while it
> may be well suited to incremental on-line learning since one could subsample
> recent history and older estimators can be dropped progressively.
What you describe is quite different from what sklearn models
typically
It means that in your script you should print the score on the
validation set instead of the test set.
Then you are allowed to tweak the values in your params dict to see if
you can find values that improve that score.
Once you are confident that you can no longer improve the validation
score via
GridSearchCV will automatically generate the validation sets
internally (this is where the "CV" comes from). So you don't have to
generate a validation set if you decide to use GridSearchCV to select
the best model.
More details here:
http://scikit-learn.org/stable/model_selection.html
--
Olivi
> I believe this `arch -i386` only works as a prefix for Python.org Python,
> but I'm happy to be corrected.
Then the following should work:
arch -i386 python -c "import nose; nose.main()" sklearn
___
scikit-learn mailing list
scikit-learn@python.org
Sorry for the late reply,
Before working on this release I would like to automate the wheel
generation process (for the release wheels) in a single repo that will
generate wheels for linux, osx and windows based on
https://github.com/matthew-brett/multibuild
I plan to put that repo under
https://
Thanks Matthew I had not realized. I will add an appveyor config there
with a dedicated `sklearn-wheels` account so that we don't wait for
the `sklearn-ci` jobs when we are building the release wheels.
--
Olivier
___
scikit-learn mailing list
scikit-lea
Ok I fixed all the 32 bit Linux & OSX build issues in scikit-learn
master and all wheels builds are green with the multibuild setup:
https://travis-ci.org/MacPython/scikit-learn-wheels
Matthew: would you be interested in having the multibuild repo
extended to also include appveyor configration fi
The error message mentions gcc. Have you installed some mingw version?
As of now our windows build is only properly tested with the Visual
Studio C++ compiler from appveyor:
https://ci.appveyor.com/project/sklearn-ci/scikit-learn
I have not tested the build with mingwpy in a while (I am not a
wi
Ok for pushing back. Let's try to work on the beta on the week after
euroscipy if we can.
At least all the annoying binary packaging issues are fixed (test
failures for the linux and OSX 32 bit platforms) so the release
process itself should hopefully be painless.
--
Olivier
I am not sure this is exactly the same because we do not center the
data in the TruncatedSVD case (as opposed to the real PCA case where
whitening is the same as calling StandardScaler).
Having an option to normalize the transformed data by sigma seems like
a good idea but we should probably not c
BTW Roman, the examples in your gist would make a great non-regression
test for this new feature. Please feel free to submit a PR.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
I would be +1 to add the dependencies to numpy and scipy on the binary
wheels only.
We don't have the tools yet but this could be implemented in the
auditwheel tool that is already used to generate the manylinux1
compatible wheels for Linux.
--
Olivier
___
I have not noticed it myself. Let me try to time this email to check:
sent at 3:01pm CEST.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
It's already in the archive:
https://mail.python.org/pipermail/scikit-learn/2016-September/000495.html
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
If this is your first contribution to the project, I would strongly
suggest to start by contributing a small bug fix or improvement to get
accustomed to the kind of things the core devs expect when reviewing a
PR.
Also please read the contributors guide :
http://scikit-learn.org/dev/developers/co
I don't think anybody is working on this but you should better check
in github pull requests.
Best,
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
I cannot reproduce such a degradation on my machine:
(sklearn-0.17)ogrisel@is146148:~/code/scikit-learn$ python
~/tmp/bench_vectorizer.py
scikit-learn 0.17.1. Numpy 1.11.2. Python 3.5.0 x86_64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in 4.033604383468628 seconds, resulting
That's really weird. I don't have a windows machine handy at the
moment. It would be nice if someone else could confirm.
Could you please run the Python profiler on this to see where the time
is spent on the slow setup?
--
Olivier
___
scikit-learn mail
If it works, +1 on my side. I think I have never used `git merge
--rebase` in the past.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
You can indeed derive from BaseEstimator and implement fit, predict
and optionally score.
Here is the documentation for the expected estimator API:
http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects
As this is a linear regression model, you can also want to
Hi all,
I think we should release 0.18.2 to get some important fixes and make
it easy to release Python 3.6 wheel package for all the operating
systems using the automated procedure.
I identified a couple of PR to backport to 0.18.X to prepare the
0.18.2 release. Are there any other important rec
In retrospect, making a small 0.19 release is probably a good idea.
I would like to get
https://github.com/scikit-learn/scikit-learn/pull/8002 in before
cutting the 0.19.X branch.
--
Olivier Grisel
___
scikit-learn mailing list
scikit-learn@python.org
I would rather like to get it out before April ideally and instead of
setting up a roadmap I would rather just identify bugs that are
blockers and fix only those and don't wait for any feature before
cutting 0.19.X.
--
Olivier
___
scikit-learn mailing l
This is non-parametric (aka brute force) way to check that a model has a
predictive performance significantly higher than chance. For models with
90% accuracy this is useless as we already know for sure that the model is
better than predicting at random. This method is only useful if you have
very
It's ok to work on a bug if the original contributor has not replied
to the reviewers comments in a while (e.g. a couple of weeks).
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
I don't think we have any model dedicated to this, but it's possible
that expressive non-parametricmodels such as RF and GBRT or richly
parameterized models such as MLP with a regression loss can do a good
enough job at giving you a point estimate.
--
Olivier
_
Personally I don't feel like mentoring this year. I would really like
to focus my scikit-learn time on finishing the joblib process
refactoring with Thomas Moreau and the binning / thread-based
parallelization of boosted trees with Guillaume and Raghav.
--
Olivier
Note that SGD is not very good at optimizing finely with a non-smooth
penalty (e.g. l1 or elasticnet). The future SAGA solver is going to be
much better at finding the optimal sparsity support (although this
support is not guaranteed to be stable across re-sampling of the
training set if the traini
>From a generalization point of view (test accuracy), the optimal
sparsity support should not matter much though, but it can be helpful
to find a the optimally sparsest solution for either computational
constraints (smaller models with a lower prediction latency) and
interpretation of the weights (
For large enough models (e.g. random forests or gradient boosted trees
ensembles) I would definitely recommend arbitrary integer coding for
the categorical variables.
Try both, use cross-validation and see for yourself.
--
Olivier
___
scikit-learn mail
Integer coding will indeed make the DT assume an arbitrary ordering
while one-hot encoding does not force the tree model to make that
assumption.
However in practice when the depth of the trees is not too limited (or
if you use a large enough ensemble of trees), the model will have
enough flexibil
Please provide the full traceback. Without it it's impossible to tell
whether the problem is in scikit-learn or xgboost.
Also, please provide a minimal reproduction script as explained in:
http://scikit-learn.org/stable/faq.html#what-s-the-best-way-to-get-help-on-scikit-learn-usage
--
Olivier
_
Thanks Matthew,
I have uploaded your Python 3.6 wheel for MacOSX to PyPI.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
+1 for recommending to use `pip install --editable .`.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
+1 for changing this example to have error bars represent 5 & 95
percentiles or 25 and 75 percentiles (quartiles).
Or event bootstrapped confidence intervals or the mean feature
importance for each variable. This might be a bit too verbose for an
example though.
> Perhaps more importantly - is a
Hi all,
FYI I have just submitted a 90 min tutorial on scikit-learn to the
EuroScipy CFP. If anybody is interested in co-teaching / TA-ing this
workshop please let me know.
I also plan to stay for the one-day sprint to help people make their
first contribution to the project. Last year we had gre
Hi Tim,
Thanks for the help.
I was planning to do a quick sklearn intro based on slides such as the
first part of:
https://speakerdeck.com/ogrisel/intro-to-scikit-learn-and-whats-new-in-0-dot-17
(but I would like to re-do them in HTML with remark.js as I do here:
https://github.com/ogrisel/decks
You can have a look at the test named "test_agglomerative_clustering" in:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/tests/test_hierarchical.py
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.
Thanks for this report!
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
I am pretty sure this is exactly the kind of presentation that the
EuroScipy audience would enjoy. Please submit!
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
2017-07-06 15:10 GMT+02:00 Olivier Grisel :
> (and just make sure that the "components" is a synonym for "dictionary
> atoms" in the literature).
Actually I meant: and just make sure that our documentation states explicitly
that the "components" is a s
I think the documentation is correct. U, a.k.a. "the code" or "the
activations" has shape (n_samples, n_components) and V a.k.a. "the
dictionary" or "the components" has shape (n_components, n_features) in
both case.
We could use n_components uniformly instead of n_atoms for consistency's
sake (an
The name of the algorithm / model would be "L2-penalized linear model
with modified Huber loss trained with Stochastic Gradient Descent".
SVM is traditionally used to describe models that use the hinge loss
only (or sometimes the squared hinge loss too).
Only the log loss can be lead to a probabi
Please use this mailing list if you have targeted scikit-learn mailing
list questions. Otherwise you should better ask a specific question on
an NLP and datascience community platform such as:
https://datascience.stackexchange.com/questions/tagged/nlp
or if you have a programming related question
If this is the first time you contribute, please make sure to
carefully read the contributors guide till the end:
http://scikit-learn.org/stable/developers/contributing.html
In particular, make sure to follow the estimators API conventions for
your PR to get a chance to be reviewed. In particular
The new release is coming and we are seeking feedback from beta testers!
pip install scikit-learn==0.19b2
conda-forge packages should follow in the coming hours / days.
Note that many models have changed behaviors and some things have been
deprecated, see the full changelog at:
http://scikit-
I believe so even though it's always better to check in the code to see how
this parameter is actually used.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
I have no idea whether the randomized SVD method is supposed to work for
complex data or not (from a mathematical point of view). I think that all
scikit-learn estimators assume real data (or integer data for class labels)
and our input validation utilities will cast numeric values to float64 by
de
Grab it with pip or conda !
Quoting the release highlights from the website:
We are excited to release a number of great new features including
neighbors.LocalOutlierFactor for anomaly detection,
preprocessing.QuantileTransformer for robust feature transformation, and
the multioutput.ClassifierCh
+1
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
+1 for python.org if they accept this kind of mailing lists.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Congrats to all three of you! Thank you very much for your contributions
and in particular in reviewing contributions by others.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
> Do I need to write object oriented or are functions also ok?
I you want to contribute an implementation as a new project on scikit-learn
contrib, you should be careful to follow the scikit-learn estimators API:
http://scikit-learn.org/dev/developers/contributing.html#apis-of-scikit-learn-object
Maybe update your version of Cython?
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Interesting project!
BTW, do you know about dask-ml [1]?
It might be interesting to think about generalizing the input validation of
fit and predict / transform as a private method of the BaseEstimator class
instead of directly calling into sklearn.utils.validation functions so has
to make it eas
Have you had a look at BIRCH?
http://scikit-learn.org/stable/modules/clustering.html#birch
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
It looks nice, thanks for sharing.
Do you plan to couple the active learner with a UX-optimized labeling
interface (for instance with a react.js or similar frontend and a flask or
similar backend)?
--
Olivier
___
scikit-learn mailing list
scikit-lear
Hi everyone!
Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a
scikit-learn core developer!
Joris is one of the maintainers of the pandas project and recently
contributed many new great PRs to scikit-learn (notably the
ColumnTransformer and a refactoring of the categorical var
That's a cool trick but I am worried it would render our API too
"frameworky" for my taste.
I think the FunctionTransformer is enough:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html
___
scikit-learn maili
This looks like a very useful project.
There is also scikits-bootstraps [1]. Personally I prefer the flat package
namespace of resample (I am not a fan of the 'scikits' namespace package)
but I still think it would be great to contact the author to know if he
would be interested in joining efforts
I believe it would fit in sklearn-contrib even if it's more for statistical
inference rather than machine learning style prediction.
Others might disagree.
Anyways, joining efforts to improve documentation, CI, testing and so on is
always a good thing for your future users.
--
Olivier
_
Joy !
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Le mer. 26 sept. 2018 à 23:02, Joel Nothman a
écrit :
> And for those interested in what's in the pipeline, we are trying to draft
> a roadmap...
> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018
>
> But there are no doubt many features that are absent there too.
>
Indeed, i
>
>
> > I think model serialization should be a priority.
>
There is also the ONNX specification that is gaining industrial adoption
and that already includes open source exporters for several families of
scikit-learn models:
https://github.com/onnx/onnxmltools
--
Olivier
__
You might also want to have a look at https://github.com/onnx/onnxmltools
although I am not sure if there are RF optimized ONNX runtimes at this
point.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listi
We can also do Paris in April / May or June if that's ok with Joel and
better for Andreas.
I am teaching on Fridays from end of January to March. But I can miss half
a day of sprint to teach my class.
--
Olivier
___
scikit-learn mailing list
scikit-lea
+1 on the ideal in general (and to enforce this on new classes / params).
+1 to be conservative and not break existing code.
Le mar. 20 nov. 2018 à 21:09, Joris Van den Bossche <
jorisvandenboss...@gmail.com> a écrit :
> Op zo 18 nov. 2018 om 11:14 schreef Joel Nothman :
>
>> I think we're all ag
Maybe a subset of the criteo TB dataset?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Congrats and welcome Adrin!
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
They are very different statistical models from a mathematical point of
view. See the online scikit-learn documentation or reference text books
such as "Elements of Statistical Learning" for more details.
In practice, linear model tends to be faster to fit on large data,
especially when the number
You should probably just "conda update scikit-learn":
scikit-learn 0.20.1 is available on the official anaconda channel for all
supported operating systems:
https://anaconda.org/anaconda/scikit-learn
--
Olivier
___
scikit-learn mailing list
scikit-learn
ople say that they
> >> > might be available at this time. It is good for many people, or
> should we
> >> > organize a doodle?
> >> >
> >> > G
> >> >
> >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
&
I would also add generalizing early stopping options to most estimators.
This is a bit related to Joel's point on max_iter consistency in
LogisticRegression.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailm
+1
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
\o/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
A quick bugfix release to fix a critical regression in the computation
of the euclidean distances returning incorrect values silently.
This release also includes other bugfixes listed in the changelog:
https://scikit-learn.org/0.21/whats_new.html#version-0-21-2
The PyPI.org wheels and conda-forg
I think it's ok to do as you said.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
How many cores du you have on this machine?
joblib.cpu_count()
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
You have to use a dedicated framework to distribute the computation on a
cluster like you cray system.
You can use mpi, or dask with dask-jobqueue but the also need to run
parallel algorithms that are efficient when running in a distributed with a
high cost for communication between distributed wo
The core developers of Scikit-learn have recently voted to welcome
Jérémie Du Boisberranger to the team, in recognition of his efforts
and trustworthiness as contributor. Jérémie's works at Inria Saclay
and is supported by the scikit-learn initiative at Fondation Inria and
its partners.
Congratula
+1 for last Monday of each month. How about the duration? 1h max + breakout
in smaller groups on more specific topics if needed?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Le jeu. 18 juil. 2019 à 08:29, Adrin a écrit :
>
> BTW, where was the meeting for last Monday organized? I don't think I knew it
> was happening.
I do not understand what you are referring to. My email was about the
organization of future meetings as suggested by Andreas.
___
I just found this planner to give it a try:
https://www.timeanddate.com/worldclock/meetingtime.html?day=29&month=7&year=2019&p1=240&p2=33&p3=37&p4=179&iv=0
(Berlin and Paris are on the same timezone so I did not put only Berlin).
It's going to be challenging to find a timeslot for every body. Th
Le mar. 5 nov. 2019 à 12:46, Gael Varoquaux
a écrit :
>
> On Mon, Nov 04, 2019 at 10:14:26PM -0700, Andreas Mueller wrote:
> > Should we re-purpose the existing twitter account or make a new one?
> > https://twitter.com/scikit_learn
>
> I think that we should repurpose it:
>
> - Make a "scikit-lea
Le ven. 15 nov. 2019 à 17:31, Nicolas Hug a écrit :
>
> What's the status of this? Would be great to have it for the 0.22 release :) !
>
+1 and we could also announce / thank / RT new sources of funding (CZI
and Fujitsu).
___
scikit-learn mailing list
s
I am not sure who has the rights to manage the twitter account. I just
sent a password reset request to "sc**@a..***"
I suspect that this is Andreas but I am not so sure.
___
scikit-learn mailing list
scikit-learn@python.org
https:
Thanks Tom, let me try to configure this.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Ok, I have sent some invites.
I would like to create @sklearn_commits instead of
@scikit_learn_commits that is too long to my taste. Any opinion?
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/
Le ven. 22 nov. 2019 à 17:24, Gael Varoquaux
a écrit :
>
> > I would like to create @sklearn_commits instead of
> > @scikit_learn_commits that is too long to my taste. Any opinion?
>
> Some people do not make the link between "sklearn" and "scikit-learn" :)
People who are likely to follow a twitt
I have created the https://twitter.com/sklearn_commits twitter account.
I have applied to make this account a "Twitter Developer" account to
be able to use https://github.com/filearts/tweethook to register it as
a webhook for the main scikit-learn github repo.
Once ready, I will remove the old we
Alright, it seems that I can create twitter apps (and generates API
tokens) for the @sklearn_commits account however
https://github.com/filearts/tweethook does not work as it relies on a
third party webtask,io service that does not accept any new
subscription...
I am looking for an alternative way
It might actually be possible to use github actions with
https://github.com/xorilog/twitter-action for instance. I will try to
give it a try with a test repo.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailm
Alright, I have configured the new github action for the tweets on
@sklearn_commits:
https://github.com/scikit-learn/scikit-learn/pull/15758
I tested it from my repo and it worked fine (I deleted the test tweet though).
We can do the switch as soon as this PR is merged.
--
Olivier
_
Ok the twitter accounts are now switched:
https://twitter.com/scikit_learn/status/1201794032650932224
The notifications for commits pushed to master are live:
https://twitter.com/sklearn_commits
Ready for the release :)
--
Olivier
___
scikit-learn m
Indeed I do not see the "circle add" button in the tweetdeck UI anymore.
But it's ok not to prepare the threads before tweeting the first
tweet. We can build the thread progressively by publishing the first
tweet and then replying one tweet after the other by hitting the reply
button of the last p
This is a minor release that includes many bug fixes and solves a
number of packaging issues with Windows wheels in particular. Here is
the full changelog:
https://scikit-learn.org/stable/whats_new/v0.22.html#version-0-22-1
The conda package will follow soon (hopefully).
Thank you very much to a
I get a message for an invalid meeting id.
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Congrats on the release! And thank you very much to all those who were
involved in making it happen (and Adrin in particular)!
--
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
1 - 100 of 125 matches
Mail list logo