Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-02-11 Thread Felipe Eltermann
https://github.com/scikit-learn/scikit-learn/pull/2848 The current state of implementation was explained in the PR comment. On Mon, Feb 10, 2014 at 3:14 PM, Olivier Grisel wrote: > 2014-02-08 2:25 GMT-08:00 Arnaud Joly : > > > > I have looked a bit at your code and it's a great start. It would b

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-02-10 Thread Olivier Grisel
2014-02-08 2:25 GMT-08:00 Arnaud Joly : > > I have looked a bit at your code and it’s a great start. It would be easier > to help you if you open a pull request. +1. Don't hesitate to open an early PR with the "[WIP]" marker as a title prefix to emphasize that you don't consider it finished work y

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-02-08 Thread Arnaud Joly
> 1- I removed a "with nogil" statement [2]. Is there a way to keep it? In order to keep it, you can pass directly the bumpy array indices, indptr, data and nnz to the splitter. Detecting the input type could be done easily in the python code. > 2- Toy sparse input test is failing, and I think

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-02-07 Thread Felipe Eltermann
Arnaud, I added a issparse attribute to Splitter base class. Doing so, I think I managed to introduce sparse support without the need of replicating Splitters' business logic code. I'm working on this branch [1] I have two questions: 1- I removed a "with nogil" statement [2]. Is there a way to ke

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-02-05 Thread Arnaud Joly
I think that I would go for the option that minimize the amount of code duplication. I would probably start with 2. Since we don’t pickle anymore the Splitter and criterion, the constructor arguments could be used to pass the X and the y matrix. Cheers, Arnaud On 04 Feb 2014, at 17:38, Feli

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-02-04 Thread Felipe Eltermann
>> How much code in our current implementation depends on the data representation? > Not much actually. It now basically boils down to simply write a new splitter object. Everything else remains the same. So basically, I would say that it amounts to 300~ lines of Cython (out of the 2300 lines in o

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-31 Thread Arnaud Joly
Here, some results on the 20 newsgroups dataset: Classifiertrain-time test-time error-rate 5-nn0.0047s 13.6651s0.5916 random forest 263.3146s3.9985s0.2459 sgd 0.2265s0.0657s

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-31 Thread Paolo Losi
On Wed, Jan 22, 2014 at 9:48 AM, Mathieu Blondel wrote: > > Something I was wondering is whether sparse support in decision trees > would actually be useful. Do decision trees (or ensembles of them like > random forests) work better than linear models for high-dimensional data? > I share your poin

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-31 Thread Olivier Grisel
2014/1/31 Felipe Eltermann : > OK, I finished reading _tree.pyx and now I understand CSC dense matrix > format. > I have a general view of what is necessary to be implemented. > > I've never seriously used Cython. What are you guys using as development > environment? Just a good text editor and a

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-31 Thread Felipe Eltermann
OK, I finished reading _tree.pyx and now I understand CSC dense matrix format. I have a general view of what is necessary to be implemented. I've never seriously used Cython. What are you guys using as development environment? How to easily code/compile/test? On Thu, Jan 23, 2014 at 11:55 AM, Ol

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-24 Thread Maheshakya Wijewardena
Hi Arnaud, I have already synced my master branch with the upstream repository and checked what the discrepancies were. There were some differences between implementations.(I have reused much of the code of forests so it was not that different) I couldn't implement cython sections of this because I

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-24 Thread Arnaud Joly
On 23 Jan 2014, at 07:18, Maheshakya Wijewardena wrote: > Arnaud, > I've gone through those messages and I've already started working on patches. > Last year I've done a project of a module in our university. It was to > implement Bagging in Scikit-learn. As Gilles had already begun that, I

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-23 Thread Olivier Grisel
2014/1/23 Felipe Eltermann : > I'm testing different classifiers for a BoW problem and last week I got > disappointed that I couldn't use scikit's DecisionTree. > However, using NaiveBayes was awesome! Thanks for this great piece of > software. > So, if you are planning to add the support for scipy

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-23 Thread Felipe Eltermann
I'm testing different classifiers for a BoW problem and last week I got disappointed that I couldn't use scikit's DecisionTree. However, using NaiveBayes was awesome! Thanks for this great piece of software. So, if you are planning to add the support for scipy sparse matrix on DecisionTree, I'd lik

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-23 Thread Gilles Louppe
> How much code in our current implementation depends on the data representation? Not much actually. It now basically boils down to simply write a new splitter object. Everything else remains the same. So basically, I would say that it amounts to 300~ lines of Cython (out of the 2300 lines in our

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-23 Thread Mathieu Blondel
> I will try using sparse data on 20newsgroups data and let you know the results. What I was suggesting is to densify the News20 dataset (using a subset of the features so that it fits in memory) and try it on our current implementation. Of course it will be really slow but the goal is to evaluate

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-23 Thread Olivier Grisel
2014/1/23 Maheshakya Wijewardena : > Hi > > As I think, using sparse data we can enhance the descriptiveness of the data > while keeping its' smaller compared to the dense data without loosing > information. I don't understand what you mean by "sparse data we can enhance the descriptiveness of the

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-22 Thread Maheshakya Wijewardena
Hi As I think, using sparse data we can enhance the descriptiveness of the data while keeping its' smaller compared to the dense data without loosing information. Isn't that what trees generally need for improved accuracy? I will try using sparse data on 20newsgroups data and let you know the res

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-22 Thread Caleb
Hi all, I am using random forest to do deep learning/feature learning using the RandomForestEmbedding in scikit-learn. It would be cool to apply  the random forest on the learned features and induced a higher level representation. I have actually tried the naive approach of densified the output

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-22 Thread Gilles Louppe
Mathieu, I have no experience with forests on sparse data, nor have I seen much work on the topic. I would be curious to investigate however, there may be problems how which this is useful. I know that Arnaud tried forests on (densified) 20newsgroups and it seems to work well actually. In partic

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-22 Thread Mathieu Blondel
Hi, Something I was wondering is whether sparse support in decision trees would actually be useful. Do decision trees (or ensembles of them like random forests) work better than linear models for high-dimensional data? It would be nice to take the News20 dataset, pre-select the top 10k features (

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-22 Thread Arnaud Joly
Hi Maheshakya, I could be one of the mentors for this GSOC. If you want to apply for a GSOC, I think that this message from Gael and Mathieu is worth reading http://sourceforge.net/mailarchive/message.php?msg_id=31864881 Best, Arnaud On 22 Jan 2014, at 06:13, Maheshakya Wijewardena wrote:

[Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-21 Thread Maheshakya Wijewardena
Hi, I have been using Scikit-learn One hot encoder for data encoding and the resulting array supports only for a few models such as logistic regression, SVC, etc. When I convert those sparse matrices with list comprehension or toarray() function to dense matrices, resulting arrays become too large