Re: [Scikit-learn-general] logsum algorithm

2013-08-29 Thread Philipp Singer
Hi, Seems to be that this is simply the so-called "logsum trick". It's actually used for underflow problems, as you already mention. This great video might help: http://www.youtube.com/watch?v=-RVM21Voo7Q Regards, Philipp Am 29.08.2013 19:32, schrieb David Reed: Hello, Was hoping someone co

[Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Hi there, I am currently working with the TfidfVectorizer provided by scikit learn. However, I just came up with a problem/question. In my case I have around 20 very long documents. Some terms in these documents occur much, much more frequently than others. From my pure intuition, these terms s

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Alright! By removing the +1 the results seem much more legit. Also, the sublinear transformation makes sense. However, why use min_df=2 if I am worried about very common words? -Ursprüngliche Nachricht- Von: Lars Buitinck [mailto:larsm...@gmail.com] Gesendet: Freitag, 29. November 2013

[Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer
Hi all, I am currently trying to calculate all-pairs similarity between a large number of text documents. I am using a TfidfVectorizer for feature generation and then want to calculate cosine similarity between the pairs. Hence, I am calculating X * X.T between the L2 normalized matrices. As m

Re: [Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer
Am 04.08.2014 um 20:54 schrieb Lars Buitinck : > 2014-08-04 17:39 GMT+02:00 Philipp Singer : >> Apart from that, does anyone know a solution of how I can efficiently >> calculate the resulting matrix Y = X * X.T? I am currently thinking about >> using PyTables with

Re: [Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer
Am 04.08.2014 um 22:14 schrieb Philipp Singer : > > Am 04.08.2014 um 20:54 schrieb Lars Buitinck : > >> 2014-08-04 17:39 GMT+02:00 Philipp Singer : >>> Apart from that, does anyone know a solution of how I can efficiently >>> calculate the resulting matrix Y =

[Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
Hi, I asked a question about the sparse random projection a few days ago, but thought I should start a new topic regarding my current problem. I am calculating TFIDF weights for my text documents and then calculate cosine similarity between documents for determining the similarity between docum

Re: [Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
Just another remark regarding this: I guess I can not circumvent the negative cosine similarity values. Maybe LSA is a better approach? (TruncatedSVD) Am 08.08.2014 um 10:35 schrieb Philipp Singer : > Hi, > > I asked a question about the sparse random projection a few days ago, but

Re: [Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
schrieb Arnaud Joly : > Have you tried to increase the number of components or epsilon parameter and > density of the SparseRandomProjection? > Have you tried to normalise X prior the random projection? > > Best regards, > Arnaud > > On 08 Aug 2014, at 12:19, Philipp S

[Scikit-learn-general] Multi-target regression

2014-09-05 Thread Philipp Singer
Hey! I am currently working with data having multiple outcome variables. So for example, my outcome I want to predict can be of multiple dimension. One line of the data could look like the following: y = [10, 15] x = [13, 735478, 0.555, …] So I want to predict all dimensions of the outcome.

Re: [Scikit-learn-general] Multi-target regression

2014-09-08 Thread Philipp Singer
faik, the multi response regression forests in sklearn > will consider the correlation between features. > -- > Flavio > > > On Fri, Sep 5, 2014 at 11:03 AM, Philipp Singer wrote: >> Hey! >> >> I am currently working with data having multiple outcome variab

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-30 Thread Philipp Singer
Am 23.03.2012 13:58, schrieb Olivier Grisel: > Le 23 mars 2012 13:27, Philipp Singer a écrit : >> The IDF statistics is computed once on the whole training corpus as >> passed to the `fit` method and then reused on each call to the >> `transform` method. >> >

[Scikit-learn-general] TFIDF for short text

2012-05-03 Thread Philipp Singer
Hey! I am currently using Tweets crawled for Twitter and try to make text classification on them. My first idea was to use TFIDF for this case. But when thinking more about it, that doesn't really make sense for short texts which are limited to 140 characters, because the TF value will nearly al

[Scikit-learn-general] TFIDF for short text

2012-05-04 Thread Philipp Singer
Hey! I am currently using Tweets crawled for Twitter and try to make text classification on them. My first idea was to use TFIDF for this case. But when thinking more about it, that doesn't really make sense for short texts which are limited to 140 characters, because the TF value will nearly

[Scikit-learn-general] Classificator for probability features

2012-05-14 Thread Philipp Singer
Hey there! I am currently trying to classify a dataset which has the following format: Class1 0.3 0.5 0.2 Class2 0.9 0.1 0.0 ... So the features are probabilities that sum always up at exactly 1. I have tried several linear classifiers but I am now wondering if there is maybe some better way t

Re: [Scikit-learn-general] Classificator for probability features

2012-05-14 Thread Philipp Singer
Thanks, that sounds really promising. Is there an implementation of KL divergence in scikit-learn? If so, how can I directly use that? Regards, Philipp > Hi Philipp, > > you could try a nearest neighbors approach and use KL-divergence as > your "distance metric"** > > best, > Peter > > ** KL-

Re: [Scikit-learn-general] Classificator for probability features

2012-05-14 Thread Philipp Singer
that ;) Regards, Philipp Am 14.05.2012 21:18, schrieb David Warde-Farley: > On Mon, May 14, 2012 at 05:00:54PM +0200, Philipp Singer wrote: >> Thanks, that sounds really promising. >> >> Is there an implementation of KL divergence in scikit-learn? If so, how can >> I dire

[Scikit-learn-general] Porter Stemmer

2012-05-25 Thread Philipp Singer
Hey! Is it possible to easly include stemming to text feature extraction in scikit-learn? I know that nltk has an implementation of the Porter stemmer, but I do not want to change my whole text feature extraction process to nltl if possible. Would be nice if I could include that somehow easyly

[Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-05-30 Thread Philipp Singer
Hey! I am currently using this kernel approximation method followed by a linear SVC. It works pretty well, but today I hopped into a problem: It seems like all the feature values need to be strict positive (when I look into the code exception >= 0). Why is this the case? Is it somehow possible

Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-05-30 Thread Philipp Singer
Hey Andy! Yep I am using it successfully ;) The idea with adding epsilon sounds legit. I will try it definitely out. I think it would be nice if you could add it to your code. Would make it also easier to work with sparse matrix. Regards, Philipp > Hi Philipp. > Great to hear that someone is

Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-05-30 Thread Philipp Singer
make them dense. I haven't really looked at it but I think > it should somehow be possible to use this approximation also > on sparse matrices. > Cheers, > Andy > > Am 30.05.2012 15:45, schrieb Philipp Singer: >> Hey Andy! >> >> Yep I am using it successfully

Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-06-01 Thread Philipp Singer
Mueller: > Hi Philipp. > The problem with using sparse matrices is that adding an epsilon > would make them dense. I haven't really looked at it but I think > it should somehow be possible to use this approximation also > on sparse matrices. > Cheers, > Andy > > Am 3

Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-06-01 Thread Philipp Singer
In terms of accuracy. Runtime is not the problem. Philipp Am 01.06.2012 18:58, schrieb Andreas Mueller: > Hi Philipp. > Do you mean it performs worse in terms of accuracy or in terms of runtime? > Cheers, > Andy > > Am 01.06.2012 18:57, schrieb Philipp Singer: >> Hey! &

[Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Hey! I am currently doing text classification. I have the following setup: 78 classes max 1500 train examples per class overall around 90.000 train examples same amount of test examples I am pretty happy with the classification results (~52% f1 score) which is fine for my task. But now I have

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
e switch to another method. SVC unfortunaetly has a very long runtime compared to LinearSVC, but maybe a SGDClassifier would work. Regards, Philipp > > On 09.07.2012, at 12:43, Philipp Singer wrote: > >> Hey! >> >> I am currently doing text classification. I have the follo

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Am 09.07.2012 13:47, schrieb Peter Prettenhofer: > Hi, Hey! > > some quick thoughts: > > - if you use a multinomial Naive Bayes classifier (aka a language > model) you can fit a background model on the large dataset and use > that to smooth the model fitted on the smaller dataset. That's a nice i

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-11 Thread Philipp Singer
Am 10.07.2012 22:57, schrieb Andreas Mueller: > You can use SVC with kernel="linear". That shouldn't be much slower than > LinearSVC. > Thanks for the hint. Unfortunately, the LinearSVC implementation is much faster than the SVC implementation with a linear kernel. -

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-11 Thread Philipp Singer
Am 11.07.2012 10:02, schrieb Olivier Grisel: > 2012/7/11 Philipp Singer: >> Am 10.07.2012 22:57, schrieb Andreas Mueller: >> >>> You can use SVC with kernel="linear". That shouldn't be much slower than >>> LinearSVC. >>> >> >>

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-11 Thread Philipp Singer
Am 11.07.2012 10:11, schrieb Olivier Grisel: > > LinearSVC is based on the liblinear C++ library which AFAIK does not > support sample weight. Well, that's true. You should better have a look at SGDClassifier: > > http://scikit-learn.org/stable/modules/sgd.html > I have already tried approach

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-11 Thread Philipp Singer
Am 11.07.2012 10:17, schrieb Vlad Niculae: > > On Jul 11, 2012, at 10:14 , Philipp Singer wrote: > >> I have already tried approaches like SGDC or Multinomial Naive Bayes. I >> can improve these two classifiers with sample weighting, but the thing >> is that LinearS

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-18 Thread Philipp Singer
Am 09.07.2012 14:44, schrieb Peter Prettenhofer: > 2012/7/9 Philipp Singer : >> Am 09.07.2012 13:47, schrieb Peter Prettenhofer: >>> Hi, >> >> Hey! >>> >>> some quick thoughts: >>> >>> - if you use a multinomial Naive Bayes class

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-18 Thread Philipp Singer
Am 18.07.2012 15:32, schrieb Peter Prettenhofer: >>> In this case I would fit one MultinomialNB for the foreground model and >>> one for the background model. But how would I do the feature extraction >>> (I have text documents) in this case? Would I fit (e.g., tfidf) on the >>> whole corpus (foreg

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 18.07.2012 15:32, schrieb Peter Prettenhofer: >>> In this case I would fit one MultinomialNB for the foreground model and >>> one for the background model. But how would I do the feature extraction >>> (I have text documents) in this case? Would I fit (e.g., tfidf) on the >>> whole corpus (foreg

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 11:47, schrieb Lars Buitinck: > 2012/7/20 Philipp Singer : >> Everything works fine now. The sad thing though is that I still can't >> really improve the classification results. The only thing I can achieve >> is to get a higher recall for the classes work

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 11:47, schrieb Lars Buitinck: > > Well, since Gael already mentioned semi-supervised training using > label propagation: I have an old PR which has still not been merged, > mostly because of API reasons, that implements semi-supervised > training of Naive Bayes using an EM algorithm:

Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 15:34, schrieb Lars Buitinck: > 2012/7/20 Philipp Singer : >> I jsut have tried out your implementation of semi-supervised >> MultinomialNB. The code works flawless, but unfortunately the >> performance of the algorithm drops extremely when I trie to incorporate

Re: [Scikit-learn-general] how to pickel CountVectorizer

2012-08-08 Thread Philipp Singer
Am 08.08.2012 14:53, schrieb David Montgomery: > > So...does it make sense to pickel CountVectorizer? I just did not > want to fit CountVectorizer every time I wanted to score a svm model. > > It makes sense to pickle the fitted Vectorizer. In this case you are just trying to pickle the plain obj

Re: [Scikit-learn-general] how to pickel CountVectorizer

2012-08-08 Thread Philipp Singer
Am 08.08.2012 15:41, schrieb David Montgomery: > oh..but I want to run the below. The reason why I want to pickle. I > do picke the output of vec.fit though. So, I just want to load up a > saved vec pickle and create an array based on the fit so I can score a > svm model. > > vectorizer.transfo

Re: [Scikit-learn-general] Preparing text for GBClassifier

2012-08-08 Thread Philipp Singer
Hey! The problem seems to be the following: With the TfidfVectorizer you get back a sparse array representation. I think the GradientBoostingClassifier can't directly work with sparse matrices, whereas the first three can. So you can try it again with: training_set.toarray() HTH Philipp Am

[Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Hey there! I have seen in the past some few research papers that combined tfidf based features with LDA topic model features and they could increase their accuracy by some useful extent. I now wanted to do the same. As a simple step I just attended the topic features to each train and test sam

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
fic statistics. > > Cheers, > Andy Regards, Philipp > > > - Ursprüngliche Mail - > Von: "Philipp Singer" > An: scikit-learn-general@lists.sourceforge.net > Gesendet: Freitag, 14. September 2012 13:47:30 > Betreff: [Scikit-learn-general] Combining TFIDF

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
dreas Müller > mailto:amuel...@ais.uni-bonn.de>> wrote: > > I'd be interested in the outcome. > Let us know when you get it to work :) > > > - Ursprüngliche Mail - > Von: "Philipp Singer" mailto:kill...@gmail.com>> > An: sci

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
che Mail ----- > Von: "Philipp Singer" > An: scikit-learn-general@lists.sourceforge.net > Gesendet: Freitag, 14. September 2012 14:00:48 > Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features > > Am 14.09.2012 14:53, schrieb Andreas Müller: >> Hi Philipp.

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Hey! Am 14.09.2012 15:10, schrieb Peter Prettenhofer: > > I totally agree - I had such an issue in my research as well > (combining word presence features with SVD embeddings). > I followed Blitzer et. al 2006 and normalized** both feature groups > separately - e.g. you could normalize word presen

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Am 14.09.2012 15:28, schrieb Philipp Singer: > Okay, so I did a fast chi2 check and it seems like some LDA features > have high p-values, so they should be helpful at least. Oh, sorry. We want the lowest p-values, right? But that's the same case. There are many with low p-val

Re: [Scikit-learn-general] How to save an array of models

2012-10-18 Thread Philipp Singer
Am 17.10.2012 20:57, schrieb Kenneth C. Arnold: > > import cPickle as pickle # faster on Py2.x, default on Py3. > with open(filename, 'wb') as f: >pickle.dump(obj, f, -1) > > The -1 at the end chooses the latest file format version, which is more > compact. What exactly does "-1" do? I guess

[Scikit-learn-general] All-pairs-similarity calculation

2012-10-26 Thread Philipp Singer
Hey there! Currently I am working on very large sparse vectors and have to calculate similarity between all pairs of them. I have now looked into the available code in scikit-learn and also at corresponding literature. So I stumbled upon this paper [1] and the corresponding implementation [2].

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-26 Thread Philipp Singer
Am 26.10.2012 14:27, schrieb Olivier Grisel: > 2012/10/26 Philipp Singer : >> Hey there! >> >> Currently I am working on very large sparse vectors and have to >> calculate similarity between all pairs of them. > How many features? Are they sparse? If so which sparsi

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-26 Thread Philipp Singer
Am 26.10.2012 15:35, schrieb Olivier Grisel: > BTW, in the mean time you could encode your coocurrences as text > identifiers use either Lucene/Solr in Java using the sunburnt python > client or woosh [1] in python as a way to do efficient sparse lookups > in such a sparse matrix to be able to q

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Philipp Singer
e I have already done all my previous experiments on the complete representations, so if I could find any faster solution for my problem this would be awesome. Regards, Philipp > > On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer wrote: >> Am 26.10.2012 15:35, schrieb Olivier Grisel:

[Scikit-learn-general] Cross validation iterator - leave one out per class

2012-11-29 Thread Philipp Singer
Hey! I have the following scenario: I have e.g., three different classes. For class 0 I may have 6 samples, for class 1 ten and for class 2 four. I now want to do cross validation ten times, but in my case I want to train on all samples for a class except one which I want to use as test data.

[Scikit-learn-general] Potential problem with Leave-one-out and f1_score

2012-11-30 Thread Philipp Singer
Hey! First of all: thanks for the hints for my last post. I decided to stick around Leave-one-Out for now and Im doing grid search with cross validation using Leave-one-out. As I am interested in retrieving the F1_score I am using it as score_func. The problem now is that following error messa

Re: [Scikit-learn-general] Potential problem with Leave-one-out and f1_score

2012-11-30 Thread Philipp Singer
Am 30.11.2012 13:56, schrieb Gael Varoquaux: > On Fri, Nov 30, 2012 at 01:33:42PM +0100, Philipp Singer wrote: >> I decided to stick around Leave-one-Out for now and Im doing grid search >> with cross validation using Leave-one-out. > > Don't. This is not a good model sel

Re: [Scikit-learn-general] Potential problem with Leave-one-out and f1_score

2012-11-30 Thread Philipp Singer
Am 30.11.2012 14:22, schrieb Gael Varoquaux: > On Fri, Nov 30, 2012 at 02:05:27PM +0100, Gael Varoquaux wrote: >> (there is a bit of literature on this, but it's really hard to find, >> and it's more folk knowledge in machine learning. > > Got a kick in the butt and did my homework: > > https://git

[Scikit-learn-general] Append additional data in pipeline

2012-11-30 Thread Philipp Singer
Hey again! Today is my posting day, hope you don't bother, but I just stumbled upon a further problem. I currently use a grid search strtaifiedkfold approach that works on textual data. So I use a pipeline that does tfidf vectorization as well. The thing now is, that I want to append additiona

Re: [Scikit-learn-general] Append additional data in pipeline

2012-11-30 Thread Philipp Singer
Am 30.11.2012 17:31, schrieb Andreas Mueller: > Am 30.11.2012 16:58, schrieb Philipp Singer: >> Hey again! >> >> Today is my posting day, hope you don't bother, but I just stumbled upon >> a further problem. >> >> I currently use a grid search strtaified

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-03 Thread Philipp Singer
Thanks to Andreas I got it working now using a custom estimator for the pipeline. I am still struggling a bit to combine textual features (e.g., tfidf) with other features that work well on their own. At the moment, I am just concatanating them --> enlarging the vector. The problem now is, tha

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
> It's probably better to train a linear classifier on the text features > alone and a second (potentially non linear classifier such as GBRT or > ExtraTrees) on the predict_proba outcome of the text classifier + your > additional low dim features. > > This is some kind of stacking method (a sort

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
Am 04.12.2012 12:26, schrieb Andreas Mueller: > Am 04.12.2012 12:20, schrieb Olivier Grisel: >> 2012/12/4 Philipp Singer : >>>> It's probably better to train a linear classifier on the text features >>>> alone and a second (potentially non linear classifier s

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
> Have you scaled your additional features to the [0-1] range as the > probability features from the text classifier? > Until now I performed Scaler() (im on 0.12 atm) on the new feature space. Should I do this on my appended features only? But well, they are not exactly between 0 or 1 then. I

Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
Am 04.12.2012 15:15, schrieb Olivier Grisel: > 2012/12/4 Philipp Singer : >> >>> Have you scaled your additional features to the [0-1] range as the >>> probability features from the text classifier? >>> >> >> Until now I performed Scaler() (im on 0.

[Scikit-learn-general] Get classification report inside grid search or cv

2012-12-06 Thread Philipp Singer
Hey! Is it possible to somehow get detailed prediction information inside grid search or cross validation for individual folds or grids. So i.e., I want to know how my classes perform for each of my folds I am doing in GridSearchCV. I can read the average scores using grid_scores_ and this is

Re: [Scikit-learn-general] does anyone do dot( sparse vec, sparse vec ) ?

2012-12-27 Thread Philipp Singer
Am 27.12.2012 18:32, schrieb Olivier Grisel: > 2012/12/27 denis : >> Olivier Grisel writes: >> >>> 2012/12/27 denis : Folks, does any module in scikit-learn do dot( sparse vec, sparse vec ) a lot ? I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n) but so far

Re: [Scikit-learn-general] ANN: scikit-learn 0.13 released!

2013-01-22 Thread Philipp Singer
Great work as always guys! Eager to try out the new features, especially the feature hashing. Am 22.01.2013 00:02, schrieb Andreas Mueller: > Hi all. > I am very happy to announce the release of scikit-learn 0.13. > New features in this release include feature hashing for text processing, > passi

[Scikit-learn-general] Multilabel questions

2013-01-23 Thread Philipp Singer
Hey guys! I am currently trying to do multilabel prediction using textual features (e.g., tfidf). My data consists of a different amount of labels for a sample. One can have just one label and one can have 10 labels. I now simply built a list of tuples for my y vector. So for example: (19, 8,

Re: [Scikit-learn-general] Multilabel questions

2013-01-23 Thread Philipp Singer
p Am 23.01.2013 16:33, schrieb Andreas Mueller: > Hi Philipp. > LinearSVC can not cope with multilabel problems. > It seems it is not doing enough input validation. > You have to use OneVsRestClassifier together with LinearSVC > to do that afaik. > Cheers, > Andy > > Am 2

Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Philipp Singer
Am 23.01.2013 18:39, schrieb Lars Buitinck: > 2013/1/23 Andreas Mueller : >> Am 23.01.2013 16:47, schrieb Philipp Singer: >>> That's what I originally thought, but then I tried it with just using >>> LinearSVC and it magically worked for my sample dataset, reall

Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Philipp Singer
Yep, I know that. The PR looks promising, will look into it. Just another question: If the OVR predicts multiple labels for a sample, are they somehow ranked? I know it is just the one vs rest approach, but maybe there is some kind of confidence involved. Because then the evaluation would be i

[Scikit-learn-general] named entity extraction

2013-02-23 Thread Philipp Singer
Hey guys! I currently have the problem of doing named entity extraction on relatively short sparse textual input. I have a predefined set of concepts and training and test data. As I have no real experience with such a thing, I wanted to ask if you can recommend any technique, preferable worki

Re: [Scikit-learn-general] Problem in text feature extraction (sklearn.feature_extraction.text)

2013-02-25 Thread Philipp Singer
I guess the parser thinks about a new word after a dot and the word before (2) is not two characters long. Am 25.02.2013 08:21, schrieb amuel...@ais.uni-bonn.de: > the missing 2 in tokenizing 2.50 is indeed a bit weird, though. > > > > Tom Fawcett schrieb: > > First, thanks for all your grea

Re: [Scikit-learn-general] Imbalance in scikit-learn

2013-02-25 Thread Philipp Singer
Hey! One simple solution that often works wonders is to set the class_weight parameter of a classifier (if available) to 'auto' [1]. If you have enough data, it often also makes sense to balance the data beforehand. [1] http://scikit-learn.org/dev/modules/svm.html#unbalanced-problems Am 25.02

Re: [Scikit-learn-general] Get every package once and for all

2013-03-07 Thread Philipp Singer
Well the reason may be that EPD does not have the newest scikit learn distribution included. Afaik AdaBoost is only included to 0.14 which is the current development version which you have to install by hand. Regards, Philipp Am 07.03.2013 19:55, schrieb Mohamed Radhouane Aniba: > Hello > > I

Re: [Scikit-learn-general] Data format

2013-03-08 Thread Philipp Singer
Why do you want to convert libsvm to another structure? I don't quite get it. If you want to use examples: scikit learn has included datasets that can be directly loaded. I think this section should help: http://scikit-learn.org/stable/datasets/index.html Am 08.03.2013 18:44, schrieb Mohamed Ra

Re: [Scikit-learn-general] Multiple training instances in the HMM library

2013-03-18 Thread Philipp Singer
Well, you can quite easily append multiple sequences to each other by introducing a RESET state that you append to the first sequence and then you add the next and so on. As the HMM afaik only supports first orders this should work quite well. Regards, Philipp Am 18.03.2013 21:42, schrieb Leo

Re: [Scikit-learn-general] Multiple training instances in the HMMlibrary

2013-03-18 Thread Philipp Singer
even nastier. On Mon, Mar 18, 2013 at 1:49 PM, Philipp Singer <mailto:kill...@gmail.com>> wrote: Well, you can quite easily append multiple sequences to each other by introducing a RESET state that you append to the first sequence and then you add the next and so on. As the HMM afaik

Re: [Scikit-learn-general] Fit functions

2013-04-05 Thread Philipp Singer
Dictionaries do not have duplicate keys (labels). You could only make a list of datawithLabelX for each key label. But what is the benefit of this? Philipp Am 05.04.2013 11:37, schrieb Bill Power: > i know this is going to sound a little silly, but I was thinking there > that it might be nice to

Re: [Scikit-learn-general] tfidfvectorizer for new data

2013-04-13 Thread Philipp Singer
Yep, If I understand you correctly, you just need to call the transform method on your new data using the fitted TfidfVectorizer. Regards, Philipp Am 13.04.2013 19:13, schrieb Alex Kopp: Suppose I used tfidfvectorizer to create features, trained a classifier, did cross-validation, etc.. Let's

Re: [Scikit-learn-general] Sparse Matrix Formats

2013-04-14 Thread Philipp Singer
Afaik scikit learn works with csr matrices internally as many mathematical operations are just possible for csr matrices. Am 14.04.2013 20:01, schrieb Alex Kopp: Is there a sparse matrix format that is most efficient for sklearn? (COO vs CSR vs LIL) Thanks --

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer
Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well. You can follow the discussion about my problem here [1] i

[Scikit-learn-general] Best classification for very sparse and skewed feature matrix

2012-01-15 Thread Philipp Singer
Hey guys! I am currently trying to use the best possible classifier for my task. In my case I have regularly slightly more features than training examples and overall about 5000 features. The problem is that my representation is very sparse so I have a huge amount of zeros. The labels range fr

Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

2012-01-24 Thread Philipp Singer
Am 15.01.2012 19:45, schrieb Gael Varoquaux: > On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote: >> The problem is that my representation is very sparse so I have a huge >> amount of zeros. > That's actually good: some of our estimators are able to use a spa

[Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer
Hey! I am currently using sklearn.feature_extraction.text.Vectorizer for feature extraction of text documents I have. I am now curious and don't quite understand how the TFIDF calculation is don

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer
The IDF statistics is computed once on the whole training corpus as passed to the `fit` method and then reused on each call to the `transform` method. For a train / test split on typically call fit_transform on the train split (to compute the IDF vector on the train split only) and reuse those ID