https://github.com/scikit-learn/scikit-learn/pull/2848
The current state of implementation was explained in the PR comment.
On Mon, Feb 10, 2014 at 3:14 PM, Olivier Grisel wrote:
> 2014-02-08 2:25 GMT-08:00 Arnaud Joly :
> >
> > I have looked a bit at your code and it's a great start. It would b
2014-02-08 2:25 GMT-08:00 Arnaud Joly :
>
> I have looked a bit at your code and it’s a great start. It would be easier
> to help you if you open a pull request.
+1. Don't hesitate to open an early PR with the "[WIP]" marker as a
title prefix to emphasize that you don't consider it finished work y
> 1- I removed a "with nogil" statement [2]. Is there a way to keep it?
In order to keep it, you can pass directly the bumpy array indices, indptr,
data and nnz to the splitter. Detecting the input
type could be done easily in the python code.
> 2- Toy sparse input test is failing, and I think
Arnaud,
I added a issparse attribute to Splitter base class.
Doing so, I think I managed to introduce sparse support without the need of
replicating Splitters' business logic code.
I'm working on this branch [1]
I have two questions:
1- I removed a "with nogil" statement [2]. Is there a way to ke
I think that I would go for the option that minimize the amount of code
duplication.
I would probably start with 2. Since we don’t pickle anymore the Splitter and
criterion, the constructor
arguments could be used to pass the X and the y matrix.
Cheers,
Arnaud
On 04 Feb 2014, at 17:38, Feli
>> How much code in our current implementation depends on the data
representation?
> Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in o
Here, some results on the 20 newsgroups dataset:
Classifiertrain-time test-time error-rate
5-nn0.0047s 13.6651s0.5916
random forest 263.3146s3.9985s0.2459
sgd 0.2265s0.0657s
On Wed, Jan 22, 2014 at 9:48 AM, Mathieu Blondel wrote:
>
> Something I was wondering is whether sparse support in decision trees
> would actually be useful. Do decision trees (or ensembles of them like
> random forests) work better than linear models for high-dimensional data?
>
I share your poin
2014/1/31 Felipe Eltermann :
> OK, I finished reading _tree.pyx and now I understand CSC dense matrix
> format.
> I have a general view of what is necessary to be implemented.
>
> I've never seriously used Cython. What are you guys using as development
> environment?
Just a good text editor and a
OK, I finished reading _tree.pyx and now I understand CSC dense matrix
format.
I have a general view of what is necessary to be implemented.
I've never seriously used Cython. What are you guys using as development
environment? How to easily code/compile/test?
On Thu, Jan 23, 2014 at 11:55 AM, Ol
Hi Arnaud,
I have already synced my master branch with the upstream repository and
checked what the discrepancies were. There were some differences between
implementations.(I have reused much of the code of forests so it was not
that different) I couldn't implement cython sections of this because I
On 23 Jan 2014, at 07:18, Maheshakya Wijewardena wrote:
> Arnaud,
> I've gone through those messages and I've already started working on patches.
> Last year I've done a project of a module in our university. It was to
> implement Bagging in Scikit-learn. As Gilles had already begun that, I
2014/1/23 Felipe Eltermann :
> I'm testing different classifiers for a BoW problem and last week I got
> disappointed that I couldn't use scikit's DecisionTree.
> However, using NaiveBayes was awesome! Thanks for this great piece of
> software.
> So, if you are planning to add the support for scipy
I'm testing different classifiers for a BoW problem and last week I got
disappointed that I couldn't use scikit's DecisionTree.
However, using NaiveBayes was awesome! Thanks for this great piece of
software.
So, if you are planning to add the support for scipy sparse matrix on
DecisionTree, I'd lik
> How much code in our current implementation depends on the data
representation?
Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
> I will try using sparse data on 20newsgroups data and let you know the
results.
What I was suggesting is to densify the News20 dataset (using a subset of
the features so that it fits in memory) and try it on our current
implementation. Of course it will be really slow but the goal is to
evaluate
2014/1/23 Maheshakya Wijewardena :
> Hi
>
> As I think, using sparse data we can enhance the descriptiveness of the data
> while keeping its' smaller compared to the dense data without loosing
> information.
I don't understand what you mean by "sparse data we can enhance the
descriptiveness of the
Hi
As I think, using sparse data we can enhance the descriptiveness of the
data while keeping its' smaller compared to the dense data without loosing
information. Isn't that what trees generally need for improved accuracy?
I will try using sparse data on 20newsgroups data and let you know the
res
Hi all,
I am using random forest to do deep learning/feature learning using the
RandomForestEmbedding in scikit-learn. It would be cool to apply
the random forest on the learned features and induced a higher level
representation.
I have actually tried the naive approach of densified the output
Mathieu,
I have no experience with forests on sparse data, nor have I seen much work
on the topic. I would be curious to investigate however, there may be
problems how which this is useful. I know that Arnaud tried forests on
(densified) 20newsgroups and it seems to work well actually.
In partic
Hi,
Something I was wondering is whether sparse support in decision trees would
actually be useful. Do decision trees (or ensembles of them like random
forests) work better than linear models for high-dimensional data?
It would be nice to take the News20 dataset, pre-select the top 10k
features (
Hi Maheshakya,
I could be one of the mentors for this GSOC.
If you want to apply for a GSOC, I think that this message from Gael and
Mathieu is worth reading
http://sourceforge.net/mailarchive/message.php?msg_id=31864881
Best,
Arnaud
On 22 Jan 2014, at 06:13, Maheshakya Wijewardena wrote:
Hi,
I have been using Scikit-learn One hot encoder for data encoding and the
resulting array supports only for a few models such as logistic regression,
SVC, etc. When I convert those sparse matrices with list comprehension or
toarray() function to dense matrices, resulting arrays become too large
23 matches
Mail list logo