Re: [Scikit-learn-general] Multi-target regression

2014-09-08 Thread Philipp Singer
. Afaik, the multi response regression forests in sklearn will consider the correlation between features. -- Flavio On Fri, Sep 5, 2014 at 11:03 AM, Philipp Singer kill...@gmail.com wrote: Hey! I am currently working with data having multiple outcome variables. So for example, my

[Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
Hi, I asked a question about the sparse random projection a few days ago, but thought I should start a new topic regarding my current problem. I am calculating TFIDF weights for my text documents and then calculate cosine similarity between documents for determining the similarity between

Re: [Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
Just another remark regarding this: I guess I can not circumvent the negative cosine similarity values. Maybe LSA is a better approach? (TruncatedSVD) Am 08.08.2014 um 10:35 schrieb Philipp Singer kill...@gmail.com: Hi, I asked a question about the sparse random projection a few days ago

Re: [Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
schrieb Arnaud Joly a.j...@ulg.ac.be: Have you tried to increase the number of components or epsilon parameter and density of the SparseRandomProjection? Have you tried to normalise X prior the random projection? Best regards, Arnaud On 08 Aug 2014, at 12:19, Philipp Singer kill

[Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer
Hi all, I am currently trying to calculate all-pairs similarity between a large number of text documents. I am using a TfidfVectorizer for feature generation and then want to calculate cosine similarity between the pairs. Hence, I am calculating X * X.T between the L2 normalized matrices. As

Re: [Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer
Am 04.08.2014 um 20:54 schrieb Lars Buitinck larsm...@gmail.com: 2014-08-04 17:39 GMT+02:00 Philipp Singer kill...@gmail.com: Apart from that, does anyone know a solution of how I can efficiently calculate the resulting matrix Y = X * X.T? I am currently thinking about using PyTables

Re: [Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer
Am 04.08.2014 um 22:14 schrieb Philipp Singer kill...@gmail.com: Am 04.08.2014 um 20:54 schrieb Lars Buitinck larsm...@gmail.com: 2014-08-04 17:39 GMT+02:00 Philipp Singer kill...@gmail.com: Apart from that, does anyone know a solution of how I can efficiently calculate the resulting

[Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Hi there, I am currently working with the TfidfVectorizer provided by scikit learn. However, I just came up with a problem/question. In my case I have around 20 very long documents. Some terms in these documents occur much, much more frequently than others. From my pure intuition, these terms

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Alright! By removing the +1 the results seem much more legit. Also, the sublinear transformation makes sense. However, why use min_df=2 if I am worried about very common words? -Ursprüngliche Nachricht- Von: Lars Buitinck [mailto:larsm...@gmail.com] Gesendet: Freitag, 29. November 2013

Re: [Scikit-learn-general] logsum algorithm

2013-08-29 Thread Philipp Singer
Hi, Seems to be that this is simply the so-called logsum trick. It's actually used for underflow problems, as you already mention. This great video might help: http://www.youtube.com/watch?v=-RVM21Voo7Q Regards, Philipp Am 29.08.2013 19:32, schrieb David Reed: Hello, Was hoping someone

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer
Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well. You can follow the discussion about my problem here [1]

Re: [Scikit-learn-general] Fit functions

2013-04-05 Thread Philipp Singer
Dictionaries do not have duplicate keys (labels). You could only make a list of datawithLabelX for each key label. But what is the benefit of this? Philipp Am 05.04.2013 11:37, schrieb Bill Power: i know this is going to sound a little silly, but I was thinking there that it might be nice to

Re: [Scikit-learn-general] Multiple training instances in the HMM library

2013-03-18 Thread Philipp Singer
Well, you can quite easily append multiple sequences to each other by introducing a RESET state that you append to the first sequence and then you add the next and so on. As the HMM afaik only supports first orders this should work quite well. Regards, Philipp Am 18.03.2013 21:42, schrieb

Re: [Scikit-learn-general] Multiple training instances in the HMMlibrary

2013-03-18 Thread Philipp Singer
. On Mon, Mar 18, 2013 at 1:49 PM, Philipp Singer kill...@gmail.com mailto:kill...@gmail.com wrote: Well, you can quite easily append multiple sequences to each other by introducing a RESET state that you append to the first sequence and then you add the next and so on. As the HMM afaik only

Re: [Scikit-learn-general] Data format

2013-03-08 Thread Philipp Singer
Why do you want to convert libsvm to another structure? I don't quite get it. If you want to use examples: scikit learn has included datasets that can be directly loaded. I think this section should help: http://scikit-learn.org/stable/datasets/index.html Am 08.03.2013 18:44, schrieb Mohamed

Re: [Scikit-learn-general] Get every package once and for all

2013-03-07 Thread Philipp Singer
Well the reason may be that EPD does not have the newest scikit learn distribution included. Afaik AdaBoost is only included to 0.14 which is the current development version which you have to install by hand. Regards, Philipp Am 07.03.2013 19:55, schrieb Mohamed Radhouane Aniba: Hello I am

Re: [Scikit-learn-general] Imbalance in scikit-learn

2013-02-25 Thread Philipp Singer
Hey! One simple solution that often works wonders is to set the class_weight parameter of a classifier (if available) to 'auto' [1]. If you have enough data, it often also makes sense to balance the data beforehand. [1] http://scikit-learn.org/dev/modules/svm.html#unbalanced-problems Am

[Scikit-learn-general] named entity extraction

2013-02-23 Thread Philipp Singer
Hey guys! I currently have the problem of doing named entity extraction on relatively short sparse textual input. I have a predefined set of concepts and training and test data. As I have no real experience with such a thing, I wanted to ask if you can recommend any technique, preferable

Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Philipp Singer
Yep, I know that. The PR looks promising, will look into it. Just another question: If the OVR predicts multiple labels for a sample, are they somehow ranked? I know it is just the one vs rest approach, but maybe there is some kind of confidence involved. Because then the evaluation would be

[Scikit-learn-general] Multilabel questions

2013-01-23 Thread Philipp Singer
Hey guys! I am currently trying to do multilabel prediction using textual features (e.g., tfidf). My data consists of a different amount of labels for a sample. One can have just one label and one can have 10 labels. I now simply built a list of tuples for my y vector. So for example: (19,

Re: [Scikit-learn-general] Multilabel questions

2013-01-23 Thread Philipp Singer
23.01.2013 16:33, schrieb Andreas Mueller: Hi Philipp. LinearSVC can not cope with multilabel problems. It seems it is not doing enough input validation. You have to use OneVsRestClassifier together with LinearSVC to do that afaik. Cheers, Andy Am 23.01.2013 16:27, schrieb Philipp Singer: Hey

Re: [Scikit-learn-general] ANN: scikit-learn 0.13 released!

2013-01-22 Thread Philipp Singer
Great work as always guys! Eager to try out the new features, especially the feature hashing. Am 22.01.2013 00:02, schrieb Andreas Mueller: Hi all. I am very happy to announce the release of scikit-learn 0.13. New features in this release include feature hashing for text processing,

Re: [Scikit-learn-general] does anyone do dot( sparse vec, sparse vec ) ?

2012-12-27 Thread Philipp Singer
Am 27.12.2012 18:32, schrieb Olivier Grisel: 2012/12/27 denis denis-bz...@t-online.de: Olivier Grisel olivier.grisel@... writes: 2012/12/27 denis denis-bz-gg@...: Folks, does any module in scikit-learn do dot( sparse vec, sparse vec ) a lot ? I wanted to try out a fast dot_sparse_vec

[Scikit-learn-general] Get classification report inside grid search or cv

2012-12-06 Thread Philipp Singer
Hey! Is it possible to somehow get detailed prediction information inside grid search or cross validation for individual folds or grids. So i.e., I want to know how my classes perform for each of my folds I am doing in GridSearchCV. I can read the average scores using grid_scores_ and this is

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
It's probably better to train a linear classifier on the text features alone and a second (potentially non linear classifier such as GBRT or ExtraTrees) on the predict_proba outcome of the text classifier + your additional low dim features. This is some kind of stacking method (a sort of

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
Am 04.12.2012 12:26, schrieb Andreas Mueller: Am 04.12.2012 12:20, schrieb Olivier Grisel: 2012/12/4 Philipp Singer kill...@gmail.com: It's probably better to train a linear classifier on the text features alone and a second (potentially non linear classifier such as GBRT or ExtraTrees

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
Have you scaled your additional features to the [0-1] range as the probability features from the text classifier? Until now I performed Scaler() (im on 0.12 atm) on the new feature space. Should I do this on my appended features only? But well, they are not exactly between 0 or 1 then. I

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
Am 04.12.2012 15:15, schrieb Olivier Grisel: 2012/12/4 Philipp Singer kill...@gmail.com: Have you scaled your additional features to the [0-1] range as the probability features from the text classifier? Until now I performed Scaler() (im on 0.12 atm) on the new feature space. Should I do

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-03 Thread Philipp Singer
Thanks to Andreas I got it working now using a custom estimator for the pipeline. I am still struggling a bit to combine textual features (e.g., tfidf) with other features that work well on their own. At the moment, I am just concatanating them -- enlarging the vector. The problem now is,

[Scikit-learn-general] Potential problem with Leave-one-out and f1_score

2012-11-30 Thread Philipp Singer
Hey! First of all: thanks for the hints for my last post. I decided to stick around Leave-one-Out for now and Im doing grid search with cross validation using Leave-one-out. As I am interested in retrieving the F1_score I am using it as score_func. The problem now is that following error

[Scikit-learn-general] Append additional data in pipeline

2012-11-30 Thread Philipp Singer
Hey again! Today is my posting day, hope you don't bother, but I just stumbled upon a further problem. I currently use a grid search strtaifiedkfold approach that works on textual data. So I use a pipeline that does tfidf vectorization as well. The thing now is, that I want to append

Re: [Scikit-learn-general] Append additional data in pipeline

2012-11-30 Thread Philipp Singer
Am 30.11.2012 17:31, schrieb Andreas Mueller: Am 30.11.2012 16:58, schrieb Philipp Singer: Hey again! Today is my posting day, hope you don't bother, but I just stumbled upon a further problem. I currently use a grid search strtaifiedkfold approach that works on textual data. So I use

[Scikit-learn-general] Cross validation iterator - leave one out per class

2012-11-29 Thread Philipp Singer
Hey! I have the following scenario: I have e.g., three different classes. For class 0 I may have 6 samples, for class 1 ten and for class 2 four. I now want to do cross validation ten times, but in my case I want to train on all samples for a class except one which I want to use as test

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Philipp Singer
representations, so if I could find any faster solution for my problem this would be awesome. Regards, Philipp On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer kill...@gmail.com wrote: Am 26.10.2012 15:35, schrieb Olivier Grisel: BTW, in the mean time you could encode your coocurrences as text

[Scikit-learn-general] All-pairs-similarity calculation

2012-10-26 Thread Philipp Singer
Hey there! Currently I am working on very large sparse vectors and have to calculate similarity between all pairs of them. I have now looked into the available code in scikit-learn and also at corresponding literature. So I stumbled upon this paper [1] and the corresponding implementation [2].

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-26 Thread Philipp Singer
Am 26.10.2012 14:27, schrieb Olivier Grisel: 2012/10/26 Philipp Singer kill...@gmail.com: Hey there! Currently I am working on very large sparse vectors and have to calculate similarity between all pairs of them. How many features? Are they sparse? If so which sparsity level? In detail: I

Re: [Scikit-learn-general] How to save an array of models

2012-10-18 Thread Philipp Singer
Am 17.10.2012 20:57, schrieb Kenneth C. Arnold: import cPickle as pickle # faster on Py2.x, default on Py3. with open(filename, 'wb') as f: pickle.dump(obj, f, -1) The -1 at the end chooses the latest file format version, which is more compact. What exactly does -1 do? I guess that's

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
- Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com An: scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 13:47:30 Betreff: [Scikit-learn-general] Combining TFIDF and LDA features Hey there! I have seen in the past some few research papers that combined

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
...@ais.uni-bonn.de mailto:amuel...@ais.uni-bonn.de wrote: I'd be interested in the outcome. Let us know when you get it to work :) - Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com mailto:kill...@gmail.com An: scikit-learn-general@lists.sourceforge.net

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
, this would be something I could look into. I have already tried to to feature selection with chi2 but not actually looked at the specific statistics. Cheers, Andy Regards, Philipp - Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com An: scikit-learn-general

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Hey! Am 14.09.2012 15:10, schrieb Peter Prettenhofer: I totally agree - I had such an issue in my research as well (combining word presence features with SVD embeddings). I followed Blitzer et. al 2006 and normalized** both feature groups separately - e.g. you could normalize word presence

Re: [Scikit-learn-general] how to pickel CountVectorizer

2012-08-08 Thread Philipp Singer
Am 08.08.2012 14:53, schrieb David Montgomery: So...does it make sense to pickel CountVectorizer? I just did not want to fit CountVectorizer every time I wanted to score a svm model. It makes sense to pickle the fitted Vectorizer. In this case you are just trying to pickle the plain object.

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 18.07.2012 15:32, schrieb Peter Prettenhofer: In this case I would fit one MultinomialNB for the foreground model and one for the background model. But how would I do the feature extraction (I have text documents) in this case? Would I fit (e.g., tfidf) on the whole corpus (foreground +

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 11:47, schrieb Lars Buitinck: 2012/7/20 Philipp Singer kill...@gmail.com: Everything works fine now. The sad thing though is that I still can't really improve the classification results. The only thing I can achieve is to get a higher recall for the classes working well

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 11:47, schrieb Lars Buitinck: Well, since Gael already mentioned semi-supervised training using label propagation: I have an old PR which has still not been merged, mostly because of API reasons, that implements semi-supervised training of Naive Bayes using an EM algorithm:

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 15:34, schrieb Lars Buitinck: 2012/7/20 Philipp Singer kill...@gmail.com: I jsut have tried out your implementation of semi-supervised MultinomialNB. The code works flawless, but unfortunately the performance of the algorithm drops extremely when I trie to incorporate my

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-11 Thread Philipp Singer
Am 11.07.2012 10:11, schrieb Olivier Grisel: LinearSVC is based on the liblinear C++ library which AFAIK does not support sample weight. Well, that's true. You should better have a look at SGDClassifier: http://scikit-learn.org/stable/modules/sgd.html I have already tried approaches

[Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Hey! I am currently doing text classification. I have the following setup: 78 classes max 1500 train examples per class overall around 90.000 train examples same amount of test examples I am pretty happy with the classification results (~52% f1 score) which is fine for my task. But now I have

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Am 09.07.2012 13:59, schrieb Vlad Niculae: Another (hackish) idea to try would be to keep the labels of the extra data bit give it a sample_weight low enough not to override your good training data. That's actually a great and simple idea. Would I do that similar to that example:

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Am 09.07.2012 13:47, schrieb Peter Prettenhofer: Hi, Hey! some quick thoughts: - if you use a multinomial Naive Bayes classifier (aka a language model) you can fit a background model on the large dataset and use that to smooth the model fitted on the smaller dataset. That's a nice idea.

Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-06-01 Thread Philipp Singer
In terms of accuracy. Runtime is not the problem. Philipp Am 01.06.2012 18:58, schrieb Andreas Mueller: Hi Philipp. Do you mean it performs worse in terms of accuracy or in terms of runtime? Cheers, Andy Am 01.06.2012 18:57, schrieb Philipp Singer: Hey! So I havew tried it adding

Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-05-30 Thread Philipp Singer
Hey Andy! Yep I am using it successfully ;) The idea with adding epsilon sounds legit. I will try it definitely out. I think it would be nice if you could add it to your code. Would make it also easier to work with sparse matrix. Regards, Philipp Hi Philipp. Great to hear that someone is

[Scikit-learn-general] Porter Stemmer

2012-05-25 Thread Philipp Singer
Hey! Is it possible to easly include stemming to text feature extraction in scikit-learn? I know that nltk has an implementation of the Porter stemmer, but I do not want to change my whole text feature extraction process to nltl if possible. Would be nice if I could include that somehow

[Scikit-learn-general] Classificator for probability features

2012-05-14 Thread Philipp Singer
Hey there! I am currently trying to classify a dataset which has the following format: Class1 0.3 0.5 0.2 Class2 0.9 0.1 0.0 ... So the features are probabilities that sum always up at exactly 1. I have tried several linear classifiers but I am now wondering if there is maybe some better way

Re: [Scikit-learn-general] Classificator for probability features

2012-05-14 Thread Philipp Singer
that ;) Regards, Philipp Am 14.05.2012 21:18, schrieb David Warde-Farley: On Mon, May 14, 2012 at 05:00:54PM +0200, Philipp Singer wrote: Thanks, that sounds really promising. Is there an implementation of KL divergence in scikit-learn? If so, how can I directly use that? I don't believe

[Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer
Hey! I am currently using http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Vectorizer.htmlsklearn.feature_extraction.text.Vectorizer for feature extraction of text documents I have. I am now curious and don't quite understand how the TFIDF calculation is

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer
The IDF statistics is computed once on the whole training corpus as passed to the `fit` method and then reused on each call to the `transform` method. For a train / test split on typically call fit_transform on the train split (to compute the IDF vector on the train split only) and reuse those

Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

2012-01-24 Thread Philipp Singer
Am 15.01.2012 19:45, schrieb Gael Varoquaux: On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote: The problem is that my representation is very sparse so I have a huge amount of zeros. That's actually good: some of our estimators are able to use a sparse representation to speed up