Re: [Scikit-learn-general] Multi-target regression

2014-09-08 Thread Philipp Singer
Is there a description about this somewhere? I can’t find it in the docu.

Thanks!

Am 05.09.2014 um 18:40 schrieb Flavio Vinicius flavio...@gmail.com:

 I the case of LinearRegression independent models are being fit for
 each response. But this is not the case for every multi-response
 estimator. Afaik, the multi response regression forests in sklearn
 will consider the correlation between features.
 --
 Flavio
 
 
 On Fri, Sep 5, 2014 at 11:03 AM, Philipp Singer kill...@gmail.com wrote:
 Hey!
 
 I am currently working with data having multiple outcome variables. So for 
 example, my outcome I want to predict can be of multiple dimension. One line 
 of the data could look like the following:
 
 y = [10, 15]  x = [13, 735478, 0.555, …]
 
 So I want to predict all dimensions of the outcome.
 
 I have seen that some algorithms can predict such multiple targets. I have 
 tried it with LinearRegression and it seems to work fine.
 
 I have not found a clear description of how this works though. Does it fit 
 one Regression separately for each outcome variable?
 
 Best,
 Philipp
 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 --
 Slashdot TV.  
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
Hi,

I asked a question about the sparse random projection a few days ago, but 
thought I should start a new topic regarding my current problem.

I am calculating TFIDF weights for my text documents and then calculate cosine 
similarity between documents for determining the similarity between documents. 
For dimensionality reduction I am using the Sparse Random Projection class.

My current process looks like the following:

docs = [text1, text2,…]
vec = TfidfVectorizer(max_df=0.8)
X = vec.fit_transform(docs)
proj = SparseRandomProjection()
X2 = proj.fit_transform(X)
X2 = normalize(X2) #for L2 normalization
sim = X2 * X2.T

It works reasonable well. However, I found out that the sparse random 
projection sets many weights to a negative value. Hence, also many similarity 
scores end up being negative. Given the original intention of tfidf weights 
(which should never be negative) and corresponding cosine similarity scores 
(which then should always only range between zero and one), I do not know 
whether this is an appropriate approach for my task.

Hope someone has some advice. Maybe I am also doing something wrong here.

Best,
Philipp

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
Just another remark regarding this:

I guess I can not circumvent the negative cosine similarity values. Maybe LSA 
is a better approach? (TruncatedSVD)

Am 08.08.2014 um 10:35 schrieb Philipp Singer kill...@gmail.com:

 Hi,
 
 I asked a question about the sparse random projection a few days ago, but 
 thought I should start a new topic regarding my current problem.
 
 I am calculating TFIDF weights for my text documents and then calculate 
 cosine similarity between documents for determining the similarity between 
 documents. For dimensionality reduction I am using the Sparse Random 
 Projection class.
 
 My current process looks like the following:
 
 docs = [text1, text2,…]
 vec = TfidfVectorizer(max_df=0.8)
 X = vec.fit_transform(docs)
 proj = SparseRandomProjection()
 X2 = proj.fit_transform(X)
 X2 = normalize(X2) #for L2 normalization
 sim = X2 * X2.T
 
 It works reasonable well. However, I found out that the sparse random 
 projection sets many weights to a negative value. Hence, also many similarity 
 scores end up being negative. Given the original intention of tfidf weights 
 (which should never be negative) and corresponding cosine similarity scores 
 (which then should always only range between zero and one), I do not know 
 whether this is an appropriate approach for my task.
 
 Hope someone has some advice. Maybe I am also doing something wrong here.
 
 Best,
 Philipp
 

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Sparse Random Projection negative weights

2014-08-08 Thread Philipp Singer
I always normalize X prior to the random projection as I have observed that 
this always produces more accurate results (same for LSA/SVD).

Have not tried to increase eps as this would lead to much less features and 
more error. I am also not sure how I should alter the density parameter. I feel 
safer to use it to the auto value which calculates it according to the Li et al 
paper. Could you recommend some value?

I think I will be more effective with LSA for now. Are there any specific 
recommendations for the number of components? Chose 300 for now.

Best,
Philipp

Am 08.08.2014 um 13:14 schrieb Arnaud Joly a.j...@ulg.ac.be:

 Have you tried to increase the number of components or epsilon parameter and 
 density of the SparseRandomProjection?
 Have you tried to normalise X prior the random projection?
 
 Best regards,
 Arnaud
 
 On 08 Aug 2014, at 12:19, Philipp Singer kill...@gmail.com wrote:
 
 Just another remark regarding this:
 
 I guess I can not circumvent the negative cosine similarity values. Maybe 
 LSA is a better approach? (TruncatedSVD)
 
 Am 08.08.2014 um 10:35 schrieb Philipp Singer kill...@gmail.com:
 
 Hi,
 
 I asked a question about the sparse random projection a few days ago, but 
 thought I should start a new topic regarding my current problem.
 
 I am calculating TFIDF weights for my text documents and then calculate 
 cosine similarity between documents for determining the similarity between 
 documents. For dimensionality reduction I am using the Sparse Random 
 Projection class.
 
 My current process looks like the following:
 
 docs = [text1, text2,…]
 vec = TfidfVectorizer(max_df=0.8)
 X = vec.fit_transform(docs)
 proj = SparseRandomProjection()
 X2 = proj.fit_transform(X)
 X2 = normalize(X2) #for L2 normalization
 sim = X2 * X2.T
 
 It works reasonable well. However, I found out that the sparse random 
 projection sets many weights to a negative value. Hence, also many 
 similarity scores end up being negative. Given the original intention of 
 tfidf weights (which should never be negative) and corresponding cosine 
 similarity scores (which then should always only range between zero and 
 one), I do not know whether this is an appropriate approach for my task.
 
 Hope someone has some advice. Maybe I am also doing something wrong here.
 
 Best,
 Philipp
 
 
 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer
Hi all,

I am currently trying to calculate all-pairs similarity between a large number 
of text documents. I am using a TfidfVectorizer for feature generation and then 
want to calculate cosine similarity between the pairs. Hence, I am calculating 
X * X.T between the L2 normalized matrices.

As my data is very large (X.shape = (350363, 2526183)), I thought about 
reducing the dimensionality first. I am using the SparseRandomProjection for 
this task with the default parameters. I do not normalize the tfidf features 
first, then perform the random projection and then L2 normalize the resulting 
data before I multiply the matrix with its transpose. Unfortunately, the 
resulting similarity scores are outside the expected 10% error. Mostly 
somewhere around 20%.

Does anyone know what I am doing wrong?

Apart from that, does anyone know a solution of how I can efficiently calculate 
the resulting matrix Y = X * X.T? I am currently thinking about using PyTables 
with some sort of chunked calculation algorithm. Unfortunately, this is not the 
most efficient way of doing it in terms of speed but solves the memory 
bottleneck. I need the raw similarity scores between all documents in the end.

Thanks!
Best,
Philipp
--
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer

Am 04.08.2014 um 20:54 schrieb Lars Buitinck larsm...@gmail.com:

 2014-08-04 17:39 GMT+02:00 Philipp Singer kill...@gmail.com:
 Apart from that, does anyone know a solution of how I can efficiently 
 calculate the resulting matrix Y = X * X.T? I am currently thinking about 
 using PyTables with some sort of chunked calculation algorithm. 
 Unfortunately, this is not the most efficient way of doing it in terms of 
 speed but solves the memory bottleneck. I need the raw similarity scores 
 between all documents in the end.
 
 Just decompose it:
 
 for i in range(0, X.shape[0], K):
Y_K = X * X[i:i+K].T
store_on_a_big_disk(Y_K)
 

This may work. Interesting that scipy can handle this „dimension mismatch“. Do 
you know how to do this with numpy arrays?

Would you suggest to store the result in a PyTable or memmap or maybe something 
else?

 (You can also use batches of rows instead of batches of columns, just
 make sure you have a 1TB disk available.)
 
 --
 Infragistics Professional
 Build stunning WinForms apps today!
 Reboot your WinForms applications with our WinForms controls. 
 Build a bridge from your legacy apps to the future.
 http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Sparse Random Projection Issue

2014-08-04 Thread Philipp Singer

Am 04.08.2014 um 22:14 schrieb Philipp Singer kill...@gmail.com:

 
 Am 04.08.2014 um 20:54 schrieb Lars Buitinck larsm...@gmail.com:
 
 2014-08-04 17:39 GMT+02:00 Philipp Singer kill...@gmail.com:
 Apart from that, does anyone know a solution of how I can efficiently 
 calculate the resulting matrix Y = X * X.T? I am currently thinking about 
 using PyTables with some sort of chunked calculation algorithm. 
 Unfortunately, this is not the most efficient way of doing it in terms of 
 speed but solves the memory bottleneck. I need the raw similarity scores 
 between all documents in the end.
 
 Just decompose it:
 
 for i in range(0, X.shape[0], K):
   Y_K = X * X[i:i+K].T
   store_on_a_big_disk(Y_K)
 
 
 This may work. Interesting that scipy can handle this „dimension mismatch“. 
 Do you know how to do this with numpy arrays?
 
 Would you suggest to store the result in a PyTable or memmap or maybe 
 something else?

Please, forget my comment about dimension mismatch. 

 
 (You can also use batches of rows instead of batches of columns, just
 make sure you have a 1TB disk available.)
 
 --
 Infragistics Professional
 Build stunning WinForms apps today!
 Reboot your WinForms applications with our WinForms controls. 
 Build a bridge from your legacy apps to the future.
 http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


--
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Hi there,

 

I am currently working with the TfidfVectorizer provided by scikit learn.
However, I just came up with a problem/question. In my case I have around 20
very long documents. Some terms in these documents occur much, much more
frequently than others. From my pure intuition, these terms should get
penalized heavily (close to zero) with the Tfidf procedure.

 

Nevertheless, when I look up the top tfidf terms for each document, such
high frequent terms are on the top of the list even though they occur in
each single document. I took a deeper look into the specific values, and it
appears that all these terms - which occur in _every_ document - receive idf
values of 1. However, shouldn't these be zero? Because if they are one, the
extreme high frequency (tf) counts overrule the aspect that idf should
provide, and rank them to the top.

 

I think this is done in the TfidfTransformer in this line:

# avoid division by zeros for features that occur in all documents

idf = np.log(float(n_samples) / df) + 1.0

 

Why is this specifically done? I thought the division by zero is already
covered by the smoothing. There seems to be no additional division necessary
from my understanding, because finally you only calculate tf * idf.

 

Hope someone can help me out.

 

Cheers,

Philipp

 

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Alright! By removing the +1 the results seem much more legit.

Also, the sublinear transformation makes sense. However, why use min_df=2 if I 
am worried about very common words?

-Ursprüngliche Nachricht-
Von: Lars Buitinck [mailto:larsm...@gmail.com] 
Gesendet: Freitag, 29. November 2013 14:08


 I think this is done in the TfidfTransformer in this line:

 # avoid division by zeros for features that occur in all documents

 idf = np.log(float(n_samples) / df) + 1.0

 Why is this specifically done? I thought the division by zero is 
 already covered by the smoothing. There seems to be no additional 
 division necessary from my understanding, because finally you only calculate 
 tf * idf.

I think this is a workaround for a bug in a previous iteration of tfidf. You 
can try turning it off and maybe we should turn it off in master, or replace it 
with log(n_samples / (df + 1.)).

Anyway, if you're worried about very common words, try setting min_df=2, and if 
you have a few long documents, try sublinear_tf=True.
That replaces tf with 1 + log(tf) so repeated occurrences of a word get 
penalized.

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance affects 
their revenue. With AppDynamics, you get 100% visibility into your Java,.NET,  
PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] logsum algorithm

2013-08-29 Thread Philipp Singer

Hi,

Seems to be that this is simply the so-called logsum trick.

It's actually used for underflow problems, as you already mention.

This great video might help:
http://www.youtube.com/watch?v=-RVM21Voo7Q

Regards,
Philipp

Am 29.08.2013 19:32, schrieb David Reed:

Hello,

Was hoping someone could shed some light on the added complexity of 
subtracting maxv and then adding it back in at the end:


@cython.boundscheck(False)
def _logsum(int N, np.ndarray[dtype_t, ndim=1] X):
cdef int i
cdef double maxv, Xsum
Xsum = 0.0
maxv = X.max()
for i in xrange(N):
Xsum += exp(X[i] - maxv)
return log(Xsum) + maxv

Im pretty sure its to mitigate underflow or overflow errors, but seems 
like those could still be issues.



--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer
Hi Christian,

Some time ago I had similar problems. I.e., I wanted to use additional 
features to my lexical features and simple concatanation didn't work 
that well for me even though both feature sets on their own performed 
pretty well.

You can follow the discussion about my problem here [1] if you scroll 
down - ignore the starting discussion. The best solution I ended up was 
the one suggested by Olivier. You basically train a linear classifier on 
your lexical features and then use the predict_proba outcome and your 
additional categorical features for training a second classifier - for 
example random forests. It was also helpful to perform leave-one-out 
when training the probabilities (if you have few samples).

[1] 
http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5F2BJ_ms51a-61HwmNrAyRTb1W0KK7ziBPzGAcdiBRqQ%40mail.gmail.comforum_name=scikit-learn-general

If you find out anything else, let us know ;)

Regards,
Philipp

Am 01.06.2013 20:30, schrieb Christian Jauvin:
 Hi,

 I asked a (perhaps too vague?) question about the use of Random
 Forests with a mix of categorical and lexical features on two ML
 forums (stats.SE and MetaOp), but since it has received no attention,
 I figured that it might work better on this list (I'm using sklearn's
 RF of course):

 I'm working on a binary classification problem for which the dataset
 is mostly composed of categorical features, but also a few lexical
 ones (i.e. article titles and abstracts). I'm experimenting with
 Random Forests, and my current strategy is to build the training set
 by appending the k best lexical features (chosen with univariate
 feature selection, and weighted with tf-idf) to the full set of
 categorical features. This works reasonably well, but as I cannot find
 explicit references to such a strategy of using hybrid features for
 RF, I have doubts about my approach: does it make sense? Am I
 diluting the power of the RF by doing so, and should I rather try to
 combine two classifiers specializing on both types of features?

 http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-and-lexical-features

 Thanks,

 Christian

 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Fit functions

2013-04-05 Thread Philipp Singer
Dictionaries do not have duplicate keys (labels). You could only make a 
list of datawithLabelX for each key label. But what is the benefit of this?

Philipp

Am 05.04.2013 11:37, schrieb Bill Power:
 i know this is going to sound a little silly, but I was thinking there
 that it might be nice to be able to do this with scikit learn

 clf = sklearn.anyClassifier()
 clf.fit( { 0: dataWithLabel0,
 1: dataWithLabel1 } )

 instead of having to separate the data/labels manually. i guess fit
 would do that internally, but it might be nice to have this

 bill


 --
 Minimize network downtime and maximize team effectiveness.
 Reduce network management and security costs.Learn how to hire
 the most talented Cisco Certified professionals. Visit the
 Employer Resources Portal
 http://www.cisco.com/web/learning/employer_resources/index.html



 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Multiple training instances in the HMM library

2013-03-18 Thread Philipp Singer
Well, you can quite easily append multiple sequences to each other by 
introducing a RESET state that you append to the first sequence and then 
you add the next and so on. As the HMM afaik only supports first orders 
this should work quite well.


Regards,
Philipp

Am 18.03.2013 21:42, schrieb Leon Palafox:
Yes, I meant that, I think is a very important functionality, since is 
the one that would allow us to put nice Speech Recognition examples as 
well as other niceties.



On Mon, Mar 18, 2013 at 1:34 PM, Lars Buitinck l.j.buiti...@uva.nl 
mailto:l.j.buiti...@uva.nl wrote:


2013/3/18 Leon Palafox leonoe...@gmail.com
mailto:leonoe...@gmail.com:
 I know the HMM library is in a so-so case, but I was wondering
whether it
 has the capability of learning from multiple training examples,
since the
 examples in the site all focus on single trial cases.

You mean multiple sequences? Last time I checked it couldn't.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
mailto:Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
Leon Palafox, M.Sc
PhD Candidate
Iba Laboratory
+81-3-5841-8436
University of Tokyo
Tokyo, Japan.



--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Multiple training instances in the HMMlibrary

2013-03-18 Thread Philipp Singer
To be honest, I am not that familiar with Hidden Markov Models yet, but 
apply very frequently Markov chain models. In such a case this is a 
standard technique of using independent sequencies at once for training 
the model.


So let's assume we work with first-order Markov chains and have two 
independent sequencies given:


a - b - c
d - b - a

Then I would introduce a generic reset state noted as R and alternate 
the paths the following way (the first R can make sense or not, depends 
what you want to achieve, but generally I would apply it):


(R) - a - b - c - R - d - b - a - R

You train your MM (HMM) with this sequence and with the first order 
property it is no problem because the memoryless assumption implies that 
we forget everything before a Reset state.


Then for a test sequence you may have for example:

R - b - b - d - R

As mentioned, I have not tested this with HMMs, but for Markov chains 
this makes sense and works fine.


Regards,
Philipp

Am 18.03.2013 21:59, schrieb Didier Vila:


Any code of your example is more than welcome

Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye 
Close | Fleet | Hampshire | GU51 2QQ | Tel: 0871 574 7989 | Fax: 0871 
574 2992 | Email: dv...@capquestco.com mailto:mbruna...@capquestco.com


*From:*Leon Palafox [mailto:leonoe...@gmail.com]
*Sent:* 18 March 2013 20:57
*To:* scikit-learn-general@lists.sourceforge.net
*Subject:* Re: [Scikit-learn-general] Multiple training instances in 
the HMMlibrary


But I agree, is one hack that can be done outside of the code

On Mon, Mar 18, 2013 at 1:56 PM, Leon Palafox leonoe...@gmail.com 
mailto:leonoe...@gmail.com wrote:


Yeah, but wouldn't that beat the whole point of training an HMM to 
learn batches of data of Length N


If I'm following you, you would append K sequences of length N, ending 
with a whole sequence of size K*N, and when you have a new 
observation, of length N, in order to predict, you would have to tile 
it so it fits the shape of the whole Model, and each of the training 
examples can evaluate the new observation?


Sounds even nastier.

On Mon, Mar 18, 2013 at 1:49 PM, Philipp Singer kill...@gmail.com 
mailto:kill...@gmail.com wrote:


Well, you can quite easily append multiple sequences to each other by 
introducing a RESET state that you append to the first sequence and 
then you add the next and so on. As the HMM afaik only supports first 
orders this should work quite well.


Regards,
Philipp

Am 18.03.2013 21:42, schrieb Leon Palafox:

Yes, I meant that, I think is a very important functionality,
since is the one that would allow us to put nice Speech
Recognition examples as well as other niceties.

On Mon, Mar 18, 2013 at 1:34 PM, Lars Buitinck
l.j.buiti...@uva.nl mailto:l.j.buiti...@uva.nl wrote:

2013/3/18 Leon Palafox leonoe...@gmail.com
mailto:leonoe...@gmail.com:

 I know the HMM library is in a so-so case, but I was wondering
whether it
 has the capability of learning from multiple training examples,
since the
 examples in the site all focus on single trial cases.

You mean multiple sequences? Last time I checked it couldn't.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
mailto:Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Leon Palafox, M.Sc

PhD Candidate
Iba Laboratory

+81-3-5841-8436 tel:%2B81-3-5841-8436

University of Tokyo
Tokyo, Japan.


--

Everyone hates slow websites. So do we.

Make your web apps faster with AppDynamics

Download AppDynamics Lite for free today:

http://p.sf.net/sfu/appdyn_d2d_mar

___

Scikit-learn-general mailing list

Scikit-learn-general@lists.sourceforge.net  
mailto:Scikit-learn-general@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net 
mailto:Scikit-learn-general@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Leon Palafox, M.Sc
PhD Candidate
Iba

Re: [Scikit-learn-general] Data format

2013-03-08 Thread Philipp Singer
Why do you want to convert libsvm to another structure?

I don't quite get it.

If you want to use examples: scikit learn has included datasets that can 
be directly loaded. I think this section should help:
http://scikit-learn.org/stable/datasets/index.html

Am 08.03.2013 18:44, schrieb Mohamed Radhouane Aniba:
 Hello !

 I am wondering if someone has developed a snippet or a script that converts 
 libsvm format into a format directly usable by scikit without the need to use 
 of load_svmlight_file.

 The reason is that I am trying to use the examples provided on the website, 
 but all of them are written in a format that is not a libsvm one.

 Thanks

 Rad




 --
 Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
 Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the
 endpoint security space. For insight on selecting the right partner to
 tackle endpoint security challenges, access the full report.
 http://p.sf.net/sfu/symantec-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Get every package once and for all

2013-03-07 Thread Philipp Singer
Well the reason may be that EPD does not have the newest scikit learn 
distribution included.

Afaik AdaBoost is only included to 0.14 which is the current development 
version which you have to install by hand.

Regards,
Philipp

Am 07.03.2013 19:55, schrieb Mohamed Radhouane Aniba:
 Hello

 I am just starting using scikit as you might guess by now.
 Something is really frustrating about it.

 I am trying to run examples from the website to get used to the kit, some 
 just work fine, some other are not working because of missing library.

 Example I am trying to get plot_classifier_comparison.py to work but I have 
 an error message saying :

 ImportError: cannot import name AdaBoostClassifier

 Other classifier work fine, why some are not recognized ?

 Can someone point me to the way to get everything working once and for all, 
 even those packages we will not necessarily use.

 I am using Macbook pro, EPD kit (python)

 Thanks

 Rad


 --
 Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
 Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the
 endpoint security space. For insight on selecting the right partner to
 tackle endpoint security challenges, access the full report.
 http://p.sf.net/sfu/symantec-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Imbalance in scikit-learn

2013-02-25 Thread Philipp Singer
Hey!

One simple solution that often works wonders is to set the class_weight 
parameter of a classifier (if available) to 'auto' [1].

If you have enough data, it often also makes sense to balance the data 
beforehand.

[1] http://scikit-learn.org/dev/modules/svm.html#unbalanced-problems

Am 25.02.2013 14:02, schrieb Maor Hornstein:
 I'm using scikit-learn in my Python program in order to perform some
 machine-learning operations. The problem is that my data-set has severe
 imbalance issues.

 Does anyone know a solution for imbalance in scikit-learn or in python
 in general?


 Thanks :)



 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_feb



 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] named entity extraction

2013-02-23 Thread Philipp Singer
Hey guys!

I currently have the problem of doing named entity extraction on 
relatively short sparse textual input.

I have a predefined set of concepts and training and test data.

As I have no real experience with such a thing, I wanted to ask if you 
can recommend any technique, preferable working via scikit learn.

Thanks and many regards,
Philipp

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Philipp Singer
Yep, I know that.

The PR looks promising, will look into it.

Just another question: If the OVR predicts multiple labels for a sample, 
are they somehow ranked? I know it is just the one vs rest approach, but 
maybe there is some kind of confidence involved. Because then the 
evaluation would be interesting, by looking at rankings.

Regards,
Philipp

Am 24.01.2013 09:56, schrieb Joly Arnaud:
 You should also be aware that the current metrics module doesn't handle
 multilabels correctly.

 The following pr https://github.com/scikit-learn/scikit-learn/pull/1606
 might interest you. It had for multi-labels support for
 some metrics.

 Best regards,
 Arnaud Joly

 Le 23/01/2013 18:44, Andreas Mueller a écrit :
 Am 23.01.2013 18:39, schrieb Lars Buitinck:
 if you want more predictions or something...
 More in detail: OneVsRestClassifier exports an object called
 label_binarizer_, which is used to transform decision function values
 D back to class labels. By default, it picks all the classes for which
 D  0, but its threshold argument can be used to change that.

 So, if clf is an OvR classifier and

D = clf.decision_function(x)

 for a *single sample* x contains no positive values, then

# untested, may contain mistakes
clf.label_binarizer_.inverse_transform(D, threshold=(D.max() + 
 epsilon))

 will predict at least one class label for x, namely the one with the
 highest value according to the decision_function. The epsilon is
 needed because inverse_transform compares values using , not =; set
 it to a small value. Doing this for batches of samples is a bit more
 involved.

 Of course, you can set the threshold to any value. Whether any of this
 makes sense depends on your problem.

 [I used to be opposed to exporting the LabelBinarizer object on OvR
 estimators, but I guess I should give up the struggle now -- this is
 actually useful.]

 I didn't even realize this possibility existed. I would have done it by
 hand.
 Thanks for the instructions.

 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnnow-d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnnow-d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Multilabel questions

2013-01-23 Thread Philipp Singer
Hey guys!

I am currently trying to do multilabel prediction using textual features 
(e.g., tfidf).

My data consists of a different amount of labels for a sample. One can 
have just one label and one can have 10 labels.

I now simply built a list of tuples for my y vector.

So for example:
(19, 8, 7, 5)
(8, 22, 23, 6, 18, 3)
(22,)
...

I have decided as first step to use LinearSVC. When I train the 
classifier with about 10.000 samples all works fine and also the 
prediction output looks fine.

But as soon as I use all my samples (~300.000) my python.exe crashes in 
Windows. So I tried it on my Linux server, and I get a segfault error.

Does anyone know how this can happen? Am I probably doing something wrong?

I have some more questions regarding multilabel classification, but 
let's stick to this first ;)

Many Regards,
Philipp

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Multilabel questions

2013-01-23 Thread Philipp Singer
Hey,

That's what I originally thought, but then I tried it with just using 
LinearSVC and it magically worked for my sample dataset, really 
interesting. I think it is working now properly.

What I am asking myself is how exactly the decision is made for the 
multilabel prediction. Is there some way of influencing it? For example 
sometimes it predicts zero classes and sometimes several.

Is it also possible to pass a MultinomialNB to the OVR classifier? Or 
would I just use the predict_proba output and then decide myself how 
many and which labels I would predict?

Regards,
Philipp

Am 23.01.2013 16:33, schrieb Andreas Mueller:
 Hi Philipp.
 LinearSVC can not cope with multilabel problems.
 It seems it is not doing enough input validation.
 You have to use OneVsRestClassifier together with LinearSVC
 to do that afaik.
 Cheers,
 Andy

 Am 23.01.2013 16:27, schrieb Philipp Singer:
 Hey guys!

 I am currently trying to do multilabel prediction using textual features
 (e.g., tfidf).

 My data consists of a different amount of labels for a sample. One can
 have just one label and one can have 10 labels.

 I now simply built a list of tuples for my y vector.

 So for example:
 (19, 8, 7, 5)
 (8, 22, 23, 6, 18, 3)
 (22,)
 ...

 I have decided as first step to use LinearSVC. When I train the
 classifier with about 10.000 samples all works fine and also the
 prediction output looks fine.

 But as soon as I use all my samples (~300.000) my python.exe crashes in
 Windows. So I tried it on my Linux server, and I get a segfault error.

 Does anyone know how this can happen? Am I probably doing something wrong?

 I have some more questions regarding multilabel classification, but
 let's stick to this first ;)

 Many Regards,
 Philipp

 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnnow-d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnnow-d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] ANN: scikit-learn 0.13 released!

2013-01-22 Thread Philipp Singer
Great work as always guys!

Eager to try out the new features, especially the feature hashing.

Am 22.01.2013 00:02, schrieb Andreas Mueller:
 Hi all.
 I am very happy to announce the release of scikit-learn 0.13.
 New features in this release include feature hashing for text processing,
 passive-agressive classifiers, faster random forests and many more.

 There have also been countless improvements in stability, consistency and
 usability.

 Details can be found on the what's new
 http://scikit-learn.org/stable/whats_new.htmlpage.

 Sources and windows binaries are available on sourceforge,
 through pypi (http://pypi.python.org/pypi/scikit-learn/0.13) or
 can be installed directly using pip:

pip install -U scikit-learn

 A big thank you to all the contributors who made this release possible!

 In parallel to the release, we started a small survey
 https://docs.google.com/spreadsheet/viewform?formkey=dFdyeGNhMzlCRWZUdldpMEZlZ1B1YkE6MQ#gid=0
 to get to know our user base a bit more.
 If you are using scikit-learn, it would be great if you could give us
 your input.

 Best,
 Andy


 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. SALE $99.99 this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122412



 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] does anyone do dot( sparse vec, sparse vec ) ?

2012-12-27 Thread Philipp Singer
Am 27.12.2012 18:32, schrieb Olivier Grisel:
 2012/12/27 denis denis-bz...@t-online.de:
 Olivier Grisel olivier.grisel@... writes:

 2012/12/27 denis denis-bz-gg@...:
 Folks,
does any module in scikit-learn do dot( sparse vec, sparse vec ) a lot ?
 I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n)
 but so far I see only safe_sparse_dot( big sparse array, numpy array )
 e.g. for RandomPCA.
 The speed of the sparse matrix dot sparse matrix depends on the
 actual implementation of the scipy.sparse matrices.
 Olivier,
sorry, I wasn't clear: I want to try out my fast NEW implementation of
 dot( sparse vec, sparse vec )
 and am looking for a testcase in scikit-learn that does a lot of those
 to measure the speedup
 cheers
 Alright. AFAIK we don't have a use case in scikit-learn for that kind
 of operation yet.

 Computing k-nn queries using cosine similarity on a pre-normalized
 sparse vector corpus + query might be a valid use case though.
I agree. You could do something like all pairs cosine similarity using a 
large sparse matrix.


 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel

 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Get classification report inside grid search or cv

2012-12-06 Thread Philipp Singer
Hey!

Is it possible to somehow get detailed prediction information inside 
grid search or cross validation for individual folds or grids.

So i.e., I want to know how my classes perform for each of my folds I am 
doing in GridSearchCV. I can read the average scores using grid_scores_ 
and this is fine, but I want to see information one step deeper. It 
would be enough to get y_true and y_predicted for each fold.

Regards,
Philipp

--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer

 It's probably better to train a linear classifier on the text features
 alone and a second (potentially non linear classifier such as GBRT or
 ExtraTrees) on the predict_proba outcome of the text classifier + your
 additional low dim features.

 This is some kind of stacking method (a sort of ensemble method). It
 should make the text features not overwhelm the final classifier if
 the other features are informative.

Hey Olivier!

Thanks for the hints. I just tried it, but unfortunately the results are 
much worse than just using my textual features alone.

just to be sure if I am doing it right:

At first I create my textual features using a vectorizer. Then I fit a 
linear SVC on these features (training data ofc) and use predict_proba 
for my training samples again resulting in a probability distribution of 
dimension 7 (I have 7 classes).

Then I append my additional features (those are 15) and fit another 
classifier on the new data. (I tried several scaling/normalizing ideas 
without improvement)

I do the same procedure for test data. (Btw I do cross val)

While I get 0.85 f1 score for just using textual data the combined 
approach results in only 0.4.

Regards,
Philipp


--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
Am 04.12.2012 12:26, schrieb Andreas Mueller:
 Am 04.12.2012 12:20, schrieb Olivier Grisel:
 2012/12/4 Philipp Singer kill...@gmail.com:
 It's probably better to train a linear classifier on the text features
 alone and a second (potentially non linear classifier such as GBRT or
 ExtraTrees) on the predict_proba outcome of the text classifier + your
 additional low dim features.

 This is some kind of stacking method (a sort of ensemble method). It
 should make the text features not overwhelm the final classifier if
 the other features are informative.
 Hey Olivier!

 Thanks for the hints. I just tried it, but unfortunately the results are
 much worse than just using my textual features alone.

 just to be sure if I am doing it right:

 At first I create my textual features using a vectorizer. Then I fit a
 linear SVC on these features (training data ofc) and use predict_proba
 for my training samples again resulting in a probability distribution of
 dimension 7 (I have 7 classes).

 Then I append my additional features (those are 15) and fit another
 classifier on the new data. (I tried several scaling/normalizing ideas
 without improvement)

 I do the same procedure for test data. (Btw I do cross val)

 While I get 0.85 f1 score for just using textual data the combined
 approach results in only 0.4.
 Have you scaled your additional features to the [0-1] range as the
 probability features from the text classifier?

 If you do a full grid search of the SVC hyperparameters (e.g. kernel
 linear or rbf and C + gamma for RBF only) there is no reason that the
 stacked model could be worth than the original text classifier (unless
 you have very few samples and that the additional features are pure
 noise).
 Can't the stacked model be worse because of overfitting issues?
 I guess if you include a linear SVM, it might be able to learn the identity
 and be as good as the original classifier. With only RBF-SVM,
 I'm not sure this is possible.

 But testing just a linear SVM should definitely not make things worse
 if the grid search is done correctly.

I use a linear SVM for learning my probabilities for the samples (I have 
used grid search for determining the optimal paramters). Then I append 
the additional features and do as suggested gradient boosting or extra 
tree classifier. What do you mean by testing just a linear SVM? On my 
new feature space?

Btw, I just have 64 samples. I will try to append the probability 
features using leave-one-out now.


--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer

 Have you scaled your additional features to the [0-1] range as the
 probability features from the text classifier?


Until now I performed Scaler() (im on 0.12 atm) on the new feature 
space. Should I do this on my appended features only? But well, they are 
not exactly between 0 or 1 then. I would probably need MinMaxScaler from 
0.13 which I cant access atm.


--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-04 Thread Philipp Singer
Am 04.12.2012 15:15, schrieb Olivier Grisel:
 2012/12/4 Philipp Singer kill...@gmail.com:

 Have you scaled your additional features to the [0-1] range as the
 probability features from the text classifier?


 Until now I performed Scaler() (im on 0.12 atm) on the new feature
 space. Should I do this on my appended features only? But well, they are
 not exactly between 0 or 1 then. I would probably need MinMaxScaler from
 0.13 which I cant access atm.

 Variance based scaling should be good enough.


Interestingly, I get worse results when I scale using an 
ExtraTreesClassifier than if I just leave the features as they are 
(i.e., probability features between 0 and 1 and the rest something 
else). Also normalizing with axis 0 doesn't help.

Regarding the low number of samples: I agree, but cant change that atm :(

--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Append additional data in pipeline

2012-12-03 Thread Philipp Singer
Thanks to Andreas I got it working now using a custom estimator for the 
pipeline.

I am still struggling a bit to combine textual features (e.g., tfidf) 
with other features that work well on their own.

At the moment, I am just concatanating them -- enlarging the vector. 
The problem now is, that the few added features do not seem to have any 
impact on the classifier, as the accuracy is exactly the same as if I 
would use only textual features. They just seem to be overwhelmed by the 
huge amount of textual features.

Is there now some clever way of combining both feature types? Like 
probably using composite/multiple kernels?

Maybe someone has an idea about that. This is actually a thing, I am 
struggling for a bit now and still haven't found a clever way of solving it.

Regards,
Philipp

--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Potential problem with Leave-one-out and f1_score

2012-11-30 Thread Philipp Singer
Hey!

First of all: thanks for the hints for my last post.

I decided to stick around Leave-one-Out for now and Im doing grid search 
with cross validation using Leave-one-out.

As I am interested in retrieving the F1_score I am using it as 
score_func. The problem now is that following error message comes up:

ValueError: pos_label=1 is not a valid label: array([ 0.,  3.])

The problem seems to be, that the score_func thinks that it's a binary 
classification and needs a pos_label that fits to the labels in this 
case 0 or 3. Nevertheless, it is a multiclass classification. Passing 
pos_label=None doesn't work as well in this case.

Does anyone have a hint what I am doing wrong?

Thanks
Philipp

--
Keep yourself connected to Go Parallel: 
TUNE You got it built. Now make it sing. Tune shows you how.
http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Append additional data in pipeline

2012-11-30 Thread Philipp Singer
Hey again!

Today is my posting day, hope you don't bother, but I just stumbled upon 
a further problem.

I currently use a grid search strtaifiedkfold approach that works on 
textual data. So I use a pipeline that does tfidf vectorization as well. 
The thing now is, that I want to append additional features that are not 
textual to the feature data.

Is there some way of doing so in the pipeline? Of course, I could do the 
tfidf transformations etc beforehand and append the additional features 
there, but well, then the whole idea of just fitting on training data 
etc is lost.

Regards,
Philipp

--
Keep yourself connected to Go Parallel: 
TUNE You got it built. Now make it sing. Tune shows you how.
http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Append additional data in pipeline

2012-11-30 Thread Philipp Singer
Am 30.11.2012 17:31, schrieb Andreas Mueller:
 Am 30.11.2012 16:58, schrieb Philipp Singer:
 Hey again!

 Today is my posting day, hope you don't bother, but I just stumbled upon
 a further problem.

 I currently use a grid search strtaifiedkfold approach that works on
 textual data. So I use a pipeline that does tfidf vectorization as well.
 The thing now is, that I want to append additional features that are not
 textual to the feature data.
 This kind of (but not really) sounds like a job for FeatureUnion:
 http://scikit-learn.sourceforge.net/dev/modules/pipeline.html#featureunion-combining-feature-extractors

 Feature union applies to different transformers to the same input data.
 But you already start with two kinds of data, right?
Yep exactly. One with textual data and the other with other kind of 
features.
 I guess you could make your data be a list of tuples (text, non-test).
 Then you would still need a transformer that projects to the components,
 though.
 This might not be ideal.
I thought about building a custom transformer, that I can pass to the 
pipeline that somehow appends the features to train and test data. But 
the problem is, that I don't know exactly which data is used for the 
splits (i.e., with sample). How would you do it with a list of tuples?

 Though I have no better idea.

 Cheers,
 Andy
Thanks, Philipp


 --
 Keep yourself connected to Go Parallel:
 TUNE You got it built. Now make it sing. Tune shows you how.
 http://goparallel.sourceforge.net
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Keep yourself connected to Go Parallel: 
TUNE You got it built. Now make it sing. Tune shows you how.
http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Cross validation iterator - leave one out per class

2012-11-29 Thread Philipp Singer
Hey!

I have the following scenario:

I have e.g., three different classes. For class 0 I may have 6 samples, 
for class 1 ten and for class 2 four.

I now want to do cross validation ten times, but in my case I want to 
train on all samples for a class except one which I want to use as test 
data. I know that there is a Leave-One-Out mechanism in scikit learn but 
this just leaves one total example out, I now want to leave one out for 
each class.

Does this even make sense? ;) If so, is there some easy way of doing so 
in scikit learn?

Regards,
Philipp

--
Keep yourself connected to Go Parallel: 
VERIFY Test and improve your parallel project with help from experts 
and peers. http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Philipp Singer
Am 27.10.2012 23:43, schrieb Joseph Turian:
 If you only care about near matches and not the full n^2 matrix:

 +1 to OG's suggestion to use pylucene.

 You can use pylucene to generate candidates, and then compute the
 exact tf*idf cosine distance on the shortlist.

Yes exactly. I would only need the most similar matches.

The problem with the lucene solution is that I do not need tfidf. I 
really have to do simple cosine similarity on my available vectors.

So e.g., my matrix (vectors) look the following way:

[[1 2 5]
   [3 1 0]]

Now get the cosine similarity between row one and two or in this case 
get the most similar row given row one using cosine similarity without 
any further variations. As already mentioned I have the data in sparse form.

 I assume this will be n log n.

 Another option for fast all-pairs is to use locality sensitive
 hashing. (I didn't read the papers or see if that's what they do.)
 It is not clear what the accuracy will be, but it will probably be the 
 fastest.
 ]
Yeah, some kind of dimension reduction is another option, but actually 
this would be very hard for me because I have already done all my 
previous experiments on the complete representations, so if I could find 
any faster solution for my problem this would be awesome.

Regards,
Philipp

 On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer kill...@gmail.com wrote:
 Am 26.10.2012 15:35, schrieb Olivier Grisel:
 BTW, in the mean time you could encode your coocurrences as text
 identifiers use either Lucene/Solr in Java using the sunburnt python
 client or woosh [1] in python as a way to do efficient sparse lookups
 in such a sparse matrix to be able to quickly compute the non zero
 cosine similarities between all pairs. Solr also as MoreLikeThis
 queries that can be used to truncate the search to the top most
 similar samples in the set of samples in the case you have some very
 frequent non zero features that would mostly break the sparsity of the
 cosine similarity matrix. As Trey Grainger says in his talk Building
 a real time, solr-powered recommendation engine: A Lucene index is a
 multi-dimensional sparse matrix… with very fast and powerful lookup
 capabilities. [1] http://packages.python.org/Whoosh/quickstart.html
 [2]
 http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine
 Thanks, this looks promising. What do you exactly mean, by encoding
 cooccurrences as text identifiers? How would I handle my sparse vectors
 then?

 I know the MoreLikeThis functionality, but does it exactly do cosine
 similarity? The thing is, that I need this relatedness emasure for my
 studies.

 Philipp


 --
 WINDOWS 8 is here.
 Millions of people.  Your app in 30 days.
 Visit The Windows 8 Center at Sourceforge for all your go to resources.
 http://windows8center.sourceforge.net/
 join-generation-app-and-make-money-coding-fast/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] All-pairs-similarity calculation

2012-10-26 Thread Philipp Singer
Hey there!

Currently I am working on very large sparse vectors and have to 
calculate similarity between all pairs of them.

I have now looked into the available code in scikit-learn and also at 
corresponding literature.
So I stumbled upon this paper [1] and the corresponding implementation [2].

I was now thinking, if this would be a potential improvement / help for 
scikit-learn for working with very large feature files where it is still 
necessary to calculate the pair-wise similarity of vectors for different 
classificators or other tasks. So the goal would be to speed this whole 
thing up.

I am by far no expert in this thing, but just wanted to ask you guys 
about your opinion ;)

Regards,
Philipp

[1] http://www.bayardo.org/ps/www2007.pdf
[2] http://code.google.com/p/google-all-pairs-similarity-search/

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-26 Thread Philipp Singer
Am 26.10.2012 14:27, schrieb Olivier Grisel:
 2012/10/26 Philipp Singer kill...@gmail.com:
 Hey there!

 Currently I am working on very large sparse vectors and have to
 calculate similarity between all pairs of them.
 How many features? Are they sparse? If so which sparsity level?

In detail: I have a large co-occurrence matrix with a shape of around 
3.7Mill x 3.7Mill. Yes, they are sparse, but I can't tell you the exacty 
sparsity level right now, but as it seems they should be very sparse 
because a single element does not have a co-occurrence count to a large 
number of other elements in my case.

The problem is that I need cosine similarity in my case, so I also 
can't use the specific suitable implementations of distances available 
in numpy, scipy or scikit-learn, but I just pass over a callable 
function that does the job. (Currently, I am using a complete own 
implementation for this, because it is just impossible to calculate 
all-pairs-similarity for my large data at the moment)

 I have now looked into the available code in scikit-learn and also at
 corresponding literature.
 So I stumbled upon this paper [1] and the corresponding implementation [2].

 I was now thinking, if this would be a potential improvement / help for
 scikit-learn for working with very large feature files where it is still
 necessary to calculate the pair-wise similarity of vectors for different
 classificators or other tasks. So the goal would be to speed this whole
 thing up.

 I am by far no expert in this thing, but just wanted to ask you guys
 about your opinion ;)
 Computing the sparse cosine similarity matrix of a large (both
 n_samples and n_features) is really lacking in scikit-learn and I
 wanted to implement some tools to do this efficiently when working on
 my power iteration clustering pool request some time ago but never
 found the time to do it.

 My idea was to use an in-memory inverted index structure, similar to
 fulltext indexer such as lucene but using integer feature indices
 rather than string feature names / tokens.

 Such a data structure would also be interesting for the
 sklearn.neighbors to do efficient k-nearest neighbors multiclass or
 multilabel classification on high dimensional sparse data (which we
 don't address efficiently with the current BallTree datastructure that
 is optimal for less than 100 dense features).
That would be awesome as I already had the impression that k-nearest 
neighbors works very slow for large data in scikit-learn and that was 
also the link to classification I made above for which this would be 
helpful to.

 [1] http://www.bayardo.org/ps/www2007.pdf
 [2] http://code.google.com/p/google-all-pairs-similarity-search/
 Thanks for the links, added them to my reading list.
Perfect ;)

Regards,
Philipp



--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] How to save an array of models

2012-10-18 Thread Philipp Singer
Am 17.10.2012 20:57, schrieb Kenneth C. Arnold:


 import cPickle as pickle # faster on Py2.x, default on Py3.
 with open(filename, 'wb') as f:
pickle.dump(obj, f, -1)

 The -1 at the end chooses the latest file format version, which is more
 compact.

What exactly does -1 do? I guess that's the protocol. I have always 
used 2 in this case. Didn't know about -1.

Regards,
Philipp


 -Ken


 On Wed, Oct 17, 2012 at 1:31 PM, Niall Twomey twom...@gmail.com
 mailto:twom...@gmail.com wrote:

 Hi all.

 I want to save an array of models trained on lots of data to file. I
 have tried the following code (roughly speaking anyway)

 models = []
 # Populate the list of models with dict items containing one
 number and PCA and GMM models
 import pickle
 pickle.dump( models.pickle, models )


 but I get errors saying:

 AttributeError: 'list' object has no attribute 'write'.

 which presumably referrs to the models list.

 Saving them to file is crucial for me, but I have no idea how to
 proceed from here. Any advice will be appreciated.

 Thanks.


 
 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_sfd2d_oct
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 mailto:Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_sfd2d_oct



 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Am 14.09.2012 14:53, schrieb Andreas Müller:
 Hi Philipp.

Hey Andreas!
 First, you should ensure that the features all have approximately the same 
 scale.
 For example they should all be between zero and one - if the LDA features
 are much smaller than the other ones, then they will probably not be weighted 
 much.

LDA features sum up to 1 for one sample, because they describe the 
probability of one sample to belong to the different topics (in this 
case 500). So basically, they are between 0 and 1.

 Which LDA package did you use?

We used Mallet's LDA implementation, because from experience they have 
the most established smoothing processes. http://mallet.cs.umass.edu/

If we just train on the LDA features we btw get reasonable results, a 
bit worse than pure TFIDF.

 I am not very experienced with this kind of model, but maybe it would be 
 helpful
 to look at some univariate statistics, like ``feature_selection.chi2``, to see
 if the LDA features are actually helpful.

Yeah, this would be something I could look into. I have already tried to 
to feature selection with chi2 but not actually looked at the specific 
statistics.

 Cheers,
 Andy

Regards,
Philipp


 - Ursprüngliche Mail -
 Von: Philipp Singer kill...@gmail.com
 An: scikit-learn-general@lists.sourceforge.net
 Gesendet: Freitag, 14. September 2012 13:47:30
 Betreff: [Scikit-learn-general] Combining TFIDF and LDA features

 Hey there!

 I have seen in the past some few research papers that combined tfidf
 based features with LDA topic model features and they could increase
 their accuracy by some useful extent.

 I now wanted to do the same. As a simple step I just attended the topic
 features to each train and test sample with the existing tfidf features
 and performed my standard LinearSVC - oh btw thanks that the confusion
 with dense and sparse is now resolved in 0.12 ;) - on it.

 The problem now is, that the results are overall exactly similar. Some
 classes perform better and some worse.

 I am not exactly sure if this is a data problem, or comes from my lack
 of understanding of such feature extension techniques.

 Is it possible that the huge amount of tfidf features somehow overrules
 the rather small number of topic features? Do I maybe have to some
 feature modification - because tfidf and LDA features are of different
 nature?

 Maybe it is also due to the classifier and I need something else?

 Would be happy if someone could shed a little light on my problems ;)

 Regards,
 Philipp

 --
 Got visibility?
 Most devs has no idea what their production app looks like.
 Find out how fast your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219671;13503038;y?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 Got visibility?
 Most devs has no idea what their production app looks like.
 Find out how fast your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219671;13503038;y?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Am 14.09.2012 15:10, schrieb amir rahimi:
 Have you done tests using some other classifiers such as gradient
 boosting which has a kind of internal feature selection?

Actually not, but I wanted to try that out, if the runtime allows it.

 On Fri, Sep 14, 2012 at 5:36 PM, Andreas Müller
 amuel...@ais.uni-bonn.de mailto:amuel...@ais.uni-bonn.de wrote:

 I'd be interested in the outcome.
 Let us know when you get it to work :)


 - Ursprüngliche Mail -
 Von: Philipp Singer kill...@gmail.com mailto:kill...@gmail.com
 An: scikit-learn-general@lists.sourceforge.net
 mailto:scikit-learn-general@lists.sourceforge.net
 Gesendet: Freitag, 14. September 2012 14:00:48
 Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features

 Am 14.09.2012 14:53, schrieb Andreas Müller:
   Hi Philipp.

 Hey Andreas!
   First, you should ensure that the features all have approximately
 the same scale.
   For example they should all be between zero and one - if the LDA
 features
   are much smaller than the other ones, then they will probably not
 be weighted much.

 LDA features sum up to 1 for one sample, because they describe the
 probability of one sample to belong to the different topics (in this
 case 500). So basically, they are between 0 and 1.
  
   Which LDA package did you use?

 We used Mallet's LDA implementation, because from experience they have
 the most established smoothing processes. http://mallet.cs.umass.edu/

 If we just train on the LDA features we btw get reasonable results, a
 bit worse than pure TFIDF.
  
   I am not very experienced with this kind of model, but maybe it
 would be helpful
   to look at some univariate statistics, like
 ``feature_selection.chi2``, to see
   if the LDA features are actually helpful.

 Yeah, this would be something I could look into. I have already tried to
 to feature selection with chi2 but not actually looked at the specific
 statistics.
  
   Cheers,
   Andy

 Regards,
 Philipp
  
  
   - Ursprüngliche Mail -
   Von: Philipp Singer kill...@gmail.com mailto:kill...@gmail.com
   An: scikit-learn-general@lists.sourceforge.net
 mailto:scikit-learn-general@lists.sourceforge.net
   Gesendet: Freitag, 14. September 2012 13:47:30
   Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
  
   Hey there!
  
   I have seen in the past some few research papers that combined tfidf
   based features with LDA topic model features and they could increase
   their accuracy by some useful extent.
  
   I now wanted to do the same. As a simple step I just attended the
 topic
   features to each train and test sample with the existing tfidf
 features
   and performed my standard LinearSVC - oh btw thanks that the
 confusion
   with dense and sparse is now resolved in 0.12 ;) - on it.
  
   The problem now is, that the results are overall exactly similar.
 Some
   classes perform better and some worse.
  
   I am not exactly sure if this is a data problem, or comes from my
 lack
   of understanding of such feature extension techniques.
  
   Is it possible that the huge amount of tfidf features somehow
 overrules
   the rather small number of topic features? Do I maybe have to some
   feature modification - because tfidf and LDA features are of
 different
   nature?
  
   Maybe it is also due to the classifier and I need something else?
  
   Would be happy if someone could shed a little light on my problems ;)
  
   Regards,
   Philipp
  
  
 
 --
   Got visibility?
   Most devs has no idea what their production app looks like.
   Find out how fast your code is with AppDynamics Lite.
   http://ad.doubleclick.net/clk;262219671;13503038;y?
   http://info.appdynamics.com/FreeJavaPerformanceDownload.html
   ___
   Scikit-learn-general mailing list
   Scikit-learn-general@lists.sourceforge.net
 mailto:Scikit-learn-general@lists.sourceforge.net
   https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
  
  
 
 --
   Got visibility?
   Most devs has no idea what their production app looks like.
   Find out how fast your code is with AppDynamics Lite.
   http://ad.doubleclick.net/clk;262219671;13503038;y?
   http://info.appdynamics.com/FreeJavaPerformanceDownload.html
   ___
   Scikit-learn-general mailing list
   Scikit-learn-general@lists.sourceforge.net
 mailto:Scikit-learn-general

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Okay, so I did a fast chi2 check and it seems like some LDA features 
have high p-values, so they should be helpful at least.

Am 14.09.2012 15:06, schrieb Andreas Müller:
 I'd be interested in the outcome.
 Let us know when you get it to work :)


 - Ursprüngliche Mail -
 Von: Philipp Singer kill...@gmail.com
 An: scikit-learn-general@lists.sourceforge.net
 Gesendet: Freitag, 14. September 2012 14:00:48
 Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features

 Am 14.09.2012 14:53, schrieb Andreas Müller:
 Hi Philipp.

 Hey Andreas!
 First, you should ensure that the features all have approximately the same 
 scale.
 For example they should all be between zero and one - if the LDA features
 are much smaller than the other ones, then they will probably not be 
 weighted much.

 LDA features sum up to 1 for one sample, because they describe the
 probability of one sample to belong to the different topics (in this
 case 500). So basically, they are between 0 and 1.

 Which LDA package did you use?

 We used Mallet's LDA implementation, because from experience they have
 the most established smoothing processes. http://mallet.cs.umass.edu/

 If we just train on the LDA features we btw get reasonable results, a
 bit worse than pure TFIDF.

 I am not very experienced with this kind of model, but maybe it would be 
 helpful
 to look at some univariate statistics, like ``feature_selection.chi2``, to 
 see
 if the LDA features are actually helpful.

 Yeah, this would be something I could look into. I have already tried to
 to feature selection with chi2 but not actually looked at the specific
 statistics.

 Cheers,
 Andy

 Regards,
 Philipp


 - Ursprüngliche Mail -
 Von: Philipp Singer kill...@gmail.com
 An: scikit-learn-general@lists.sourceforge.net
 Gesendet: Freitag, 14. September 2012 13:47:30
 Betreff: [Scikit-learn-general] Combining TFIDF and LDA features

 Hey there!

 I have seen in the past some few research papers that combined tfidf
 based features with LDA topic model features and they could increase
 their accuracy by some useful extent.

 I now wanted to do the same. As a simple step I just attended the topic
 features to each train and test sample with the existing tfidf features
 and performed my standard LinearSVC - oh btw thanks that the confusion
 with dense and sparse is now resolved in 0.12 ;) - on it.

 The problem now is, that the results are overall exactly similar. Some
 classes perform better and some worse.

 I am not exactly sure if this is a data problem, or comes from my lack
 of understanding of such feature extension techniques.

 Is it possible that the huge amount of tfidf features somehow overrules
 the rather small number of topic features? Do I maybe have to some
 feature modification - because tfidf and LDA features are of different
 nature?

 Maybe it is also due to the classifier and I need something else?

 Would be happy if someone could shed a little light on my problems ;)

 Regards,
 Philipp

 --
 Got visibility?
 Most devs has no idea what their production app looks like.
 Find out how fast your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219671;13503038;y?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 Got visibility?
 Most devs has no idea what their production app looks like.
 Find out how fast your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219671;13503038;y?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Got visibility?
 Most devs has no idea what their production app looks like.
 Find out how fast your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219671;13503038;y?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 Got visibility?
 Most devs has no idea what their production app looks like.
 Find out how fast your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219671;13503038;y?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Hey!

Am 14.09.2012 15:10, schrieb Peter Prettenhofer:

 I totally agree - I had such an issue in my research as well
 (combining word presence features with SVD embeddings).
 I followed Blitzer et. al 2006 and normalized** both feature groups
 separately - e.g. you could normalize word presence features such that
 L1 norm equals 1 and do the same for the SVD embeddings.

Isn't the normalization alread part of the tfidf transformation?
So basically the word presence tfidf features are already L2 normalized, 
but maybe I misunderstand this completely.

 In my work I had the impression though, that L1|L2 normalization was
 inferior to simply scale the embeddings by a constant alpha such that
 the average L2 norm is 1.[1]

Ah, I see. How would I exactly do that? Isn't that the same thing as the 
normalization technique in scikit-learn is doing?

 ** normalization here means row level normalization - similar do
 document length normalization in TF/IDF.

 HTH,
   Peter

Regards,
Philipp

 Blitzer et al. 2006, Domain Adaptation using Structural Correspondence
 Learning, http://john.blitzer.com/papers/emnlp06.pdf

 [1] This is also described here:
 http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use


--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] how to pickel CountVectorizer

2012-08-08 Thread Philipp Singer
Am 08.08.2012 14:53, schrieb David Montgomery:

 So...does it make sense to pickel CountVectorizer?  I just did not
 want to fit CountVectorizer every time I wanted to score a svm model.


It makes sense to pickle the fitted Vectorizer. In this case you are 
just trying to pickle the plain object.

Regards,
Philipp

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 18.07.2012 15:32, schrieb Peter Prettenhofer:
 In this case I would fit one MultinomialNB for the foreground model and
 one for the background model. But how would I do the feature extraction
 (I have text documents) in this case? Would I fit (e.g., tfidf) on the
 whole corpus (foreground + background) and then transform both datasets
 on the fitted infos and the test dataset as well?

 Personally, I'd start without using IDF; Otherwise, wrap both
 estimators using a Pipeline and add a TfidfTransformer (see [1]).

 best,
   Peter

 [1] 
 http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html



Everything works fine now. The sad thing though is that I still can't 
really improve the classification results. The only thing I can achieve 
is to get a higher recall for the classes working well in the background 
model, but the precision sinks at the same time. Overall I am staying at 
about the same average score when incorporating the background model.

If anyone has any further ideas, please let me know ;)

Regards,
Philipp

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 11:47, schrieb Lars Buitinck:
 2012/7/20 Philipp Singer kill...@gmail.com:
 Everything works fine now. The sad thing though is that I still can't
 really improve the classification results. The only thing I can achieve
 is to get a higher recall for the classes working well in the background
 model, but the precision sinks at the same time. Overall I am staying at
 about the same average score when incorporating the background model.

 If anyone has any further ideas, please let me know ;)

 Well, since Gael already mentioned semi-supervised training using
 label propagation: I have an old PR which has still not been merged,
 mostly because of API reasons, that implements semi-supervised
 training of Naive Bayes using an EM algorithm:

  https://github.com/scikit-learn/scikit-learn/pull/430

 I've seen improvements in F1 score when doing text classification with
 this algorithm. It may take some work to get this up to speed with the
 latest scikit-learn, though.

Hey Lars,

Thanks, this looks awesome. I will try it out. The reason why I haven't 
used label propagation techniques yet is, that I could not achieve a 
fast runtime yet, because I have huge amounts of unlabeled/background 
data available.

 (Just out of curiosity, which topic models did you try? I'm looking
 into these for my own projects.)

We have been using Mallet's LDA based Parallel Topic Model.

Philipp



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 11:47, schrieb Lars Buitinck:


 Well, since Gael already mentioned semi-supervised training using
 label propagation: I have an old PR which has still not been merged,
 mostly because of API reasons, that implements semi-supervised
 training of Naive Bayes using an EM algorithm:

  https://github.com/scikit-learn/scikit-learn/pull/430

 I've seen improvements in F1 score when doing text classification with
 this algorithm. It may take some work to get this up to speed with the
 latest scikit-learn, though.

Hey again!

I jsut have tried out your implementation of semi-supervised 
MultinomialNB. The code works flawless, but unfortunately the 
performance of the algorithm drops extremely when I trie to incorporate 
my additional data.

I am starting to think that my additional data is useless :/

Just for the record:

training on my 96000 labeled data with MultinomialNB gets me a f1-score 
of 0.47. Using around 2.000.000 unlabeled additional data using your 
semi-supervised code achieves a f1-score of 0.39

Regards,
philipp

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-20 Thread Philipp Singer
Am 20.07.2012 15:34, schrieb Lars Buitinck:
 2012/7/20 Philipp Singer kill...@gmail.com:
 I jsut have tried out your implementation of semi-supervised
 MultinomialNB. The code works flawless, but unfortunately the
 performance of the algorithm drops extremely when I trie to incorporate
 my additional data.

 I am starting to think that my additional data is useless :/

 Just for the record:

 training on my 96000 labeled data with MultinomialNB gets me a f1-score
 of 0.47. Using around 2.000.000 unlabeled additional data using your
 semi-supervised code achieves a f1-score of 0.39
 Hmm, too bad. Is the extra data from a very different source?

Not very different, but documents produced by another kind of users.

I really thought that this data could improve somehow the whole 
classification process, because fitting a model on the extra data alone 
leads to an f1-score of 0.27, which is pretty good for that data.

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-11 Thread Philipp Singer
Am 11.07.2012 10:11, schrieb Olivier Grisel:


 LinearSVC is based on the liblinear C++ library which AFAIK does not
 support sample weight.

Well, that's true.

  You should better have a look at SGDClassifier:

 http://scikit-learn.org/stable/modules/sgd.html


I have already tried approaches like SGDC or Multinomial Naive Bayes. I 
can improve these two classifiers with sample weighting, but the thing 
is that LinearSVC without the incorporated data still outperforms the 
other algorithms.

But I guess I will play around a bit more ;)


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Hey!

I am currently doing text classification. I have the following setup:

78 classes
max 1500 train examples per class
overall around 90.000 train examples
same amount of test examples

I am pretty happy with the classification results (~52% f1 score) which 
is fine for my task.

But now I have another scenario. I have around 2.000.000 extra training 
examples available which are produced by a certain amount of users not 
_directly_ corresponding for the classes but I still know the labels of 
this data. If I train the classifier simply on this extra data (without 
the correct one) I can achieve a F1-score of ~25%. So this somehow tells 
me that there is information available that I now somehow want to 
incorporate to my existing data. For some few classes this data even 
works slightly better or at least similar.

I have simply tried to combine both datasets (90.000 + 2.000.000) but 
this makes the results worse (test data amount always stays the same). 
This is not surprising because a lot of noise is added to the data and I 
think that the huge amount of extra data somehow overrules the existing one.

My question now is, how I can incorporate this data the best in order to 
achieve better classification results than with my first dataset. Maybe 
someone has an idea or there are some techniques for that.

Just for the record: I use Tf-Idf with a SVC which works best. I have 
also tried a different approach using topic models.

Thanks and many regards,
Philipp

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Am 09.07.2012 13:59, schrieb Vlad Niculae:
 Another (hackish) idea to try would be to keep the labels of the extra
 data bit give it a sample_weight low enough not to override your good
 training data.

That's actually a great and simple idea. Would I do that similar to that 
example: 
http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html

So like using a 10 times higher weight for the corresponding samples for 
example as a starting point?

I see that the fit method of LinearSVC doesn't have a sample_weight 
parameter. So I guess I would have switch to another method. SVC 
unfortunaetly has a very long runtime compared to LinearSVC, but maybe a 
SGDClassifier would work.

Regards,
Philipp


 On 09.07.2012, at 12:43, Philipp Singerkill...@gmail.com  wrote:

 Hey!

 I am currently doing text classification. I have the following setup:

 78 classes
 max 1500 train examples per class
 overall around 90.000 train examples
 same amount of test examples

 I am pretty happy with the classification results (~52% f1 score) which
 is fine for my task.

 But now I have another scenario. I have around 2.000.000 extra training
 examples available which are produced by a certain amount of users not
 _directly_ corresponding for the classes but I still know the labels of
 this data. If I train the classifier simply on this extra data (without
 the correct one) I can achieve a F1-score of ~25%. So this somehow tells
 me that there is information available that I now somehow want to
 incorporate to my existing data. For some few classes this data even
 works slightly better or at least similar.

 I have simply tried to combine both datasets (90.000 + 2.000.000) but
 this makes the results worse (test data amount always stays the same).
 This is not surprising because a lot of noise is added to the data and I
 think that the huge amount of extra data somehow overrules the existing one.

 My question now is, how I can incorporate this data the best in order to
 achieve better classification results than with my first dataset. Maybe
 someone has an idea or there are some techniques for that.

 Just for the record: I use Tf-Idf with a SVC which works best. I have
 also tried a different approach using topic models.

 Thanks and many regards,
 Philipp

 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Incorporation of extra training examples

2012-07-09 Thread Philipp Singer
Am 09.07.2012 13:47, schrieb Peter Prettenhofer:
 Hi,

Hey!

 some quick thoughts:

 - if you use a multinomial Naive Bayes classifier (aka a language
 model) you can fit a background model on the large dataset and use
 that to smooth the model fitted on the smaller dataset.

That's a nice idea. Is there a simple way to try this out fast in 
scikit-learn?


 - you should look at the domain adaptation / multi-task learning
 literature - this might fit your setting better than traditional
 semi-supervised learning.

Thanks, I will look into that.

 best,
   Peter

Regards,
Philipp

 2012/7/9 Gael Varoquauxgael.varoqu...@normalesup.org:
 Hi,

 You can try setting this as a semi-supervised learning problem and using
 label propagation:

 http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation

 HTH,

 G

 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general





--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-06-01 Thread Philipp Singer
In terms of accuracy. Runtime is not the problem.

Philipp

Am 01.06.2012 18:58, schrieb Andreas Mueller:
 Hi Philipp.
 Do you mean it performs worse in terms of accuracy or in terms of runtime?
 Cheers,
 Andy

 Am 01.06.2012 18:57, schrieb Philipp Singer:
 Hey!

 So I havew tried it adding epsilon to my entries. My first intuition was
 that it performs pretty similar to my old dense version.
 But apparently I jsut hopped into cases where this method performs much
 worse :(

 Any hints on that?

 Regards,
 Philipp

 Am 30.05.2012 15:52, schrieb Andreas Mueller:
 Hi Philipp.
 The problem with using sparse matrices is that adding an epsilon
 would make them dense. I haven't really looked at it but I think
 it should somehow be possible to use this approximation also
 on sparse matrices.
 Cheers,
 Andy

 Am 30.05.2012 15:45, schrieb Philipp Singer:
 Hey Andy!

 Yep I am using it successfully ;)

 The idea with adding epsilon sounds legit. I will try it definitely out.

 I think it would be nice if you could add it to your code. Would make it
 also easier to work with sparse matrix.

 Regards,
 Philipp

 Hi Philipp.
 Great to hear that someone is using that :)

 The problem is that the approximation uses a log.
 Afaik even the exact kernel is not defined if two features are compared
 that are both exactly zeros.
 Usually I just work around that by adding an epsilon.
 I was considering adding that to the code. What do you think?

 Cheers,
 Andy
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation

2012-05-30 Thread Philipp Singer
Hey Andy!

Yep I am using it successfully ;)

The idea with adding epsilon sounds legit. I will try it definitely out.

I think it would be nice if you could add it to your code. Would make it 
also easier to work with sparse matrix.

Regards,
Philipp

 Hi Philipp.
 Great to hear that someone is using that :)

 The problem is that the approximation uses a log.
 Afaik even the exact kernel is not defined if two features are compared
 that are both exactly zeros.
 Usually I just work around that by adding an epsilon.
 I was considering adding that to the code. What do you think?

 Cheers,
 Andy



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Porter Stemmer

2012-05-25 Thread Philipp Singer
Hey!

Is it possible to easly include stemming to text feature extraction in 
scikit-learn?

I know that nltk has an implementation of the Porter stemmer, but I do 
not want to change my whole
text feature extraction process to nltl if possible. Would be nice if I 
could include that somehow easyly.

Thanks,
Philipp

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Classificator for probability features

2012-05-14 Thread Philipp Singer
Hey there!

I am currently trying to classify a dataset which has the following format:

Class1 0.3 0.5 0.2
Class2 0.9 0.1 0.0
...

So the features are probabilities that sum always up at exactly 1.

I have tried several linear classifiers but I am now wondering if there 
is maybe some better way to classify such data and achieve better results.

Maybe someone has some ideas.

Thanks and regards,
Philipp

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Classificator for probability features

2012-05-14 Thread Philipp Singer
Thanks a lot for the explanation.

So do I see this right, that I would need to calculate for each pair of 
feature vectors the KL divergence?

I have already tried to use a pipeline calculating an additive chi 
squared followed by a linear SVC. This boosts my results a bit. But I am 
still staying at an f1 score of 0.25 and I want to improve this if 
possible. Is this the right way to do this?

Maybe there are some tweaks intended, like changing the parameters etc.

Sorry for the dumb questions, but I haven't used on of these methods 
until now. Still excited to learn more about that ;)

Regards,
Philipp

Am 14.05.2012 21:18, schrieb David Warde-Farley:
 On Mon, May 14, 2012 at 05:00:54PM +0200, Philipp Singer wrote:
 Thanks, that sounds really promising.

 Is there an implementation of KL divergence in scikit-learn? If so, how can 
 I directly use that?
 I don't believe there is, but it's quite simple to do yourself. Many
 algorithms in scikit-learn can take a precomputed distance matrix.

 Given two points, p and q, on the simplex, the KL divergence between the two
 discrete distributions represented is simply (-p * np.log(p / q)).sum(). Note
 that this is in general not defined if they do not share the same support
 (i.e. if there is a zero at one spot in one but not in the other). In
 practice, if there are any zeros at all, you will need to deal with them
 clearly as the logarithm and/or the division will misbehave.

 Note that the grandparent's note that the KL divergence is not a metric is
 not a minor concern: the KL divergence, for example, is _not_ symmetric
 (KL(p, q) != KL(q, p)).  You can of course take the average of KL(p, q) and
 KL(q, p) to symmetrize it, but you still may run into problems with
 algorithms that assume that distances obey the triangle inequality (KL
 divergences do not).

 Personally I would recommend trying Andy's suggestion re: an SVM with a
 chi-squared kernel. For small instances you can precompute the kernel
 matrix and pass it to SVC yourself. If you have a lot of data (or if you want
 to try it out quickly) the kernel approximations module plus a linear SVM
 is a good bet.

 David


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer

Hey!

I am currently using 
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Vectorizer.htmlsklearn.feature_extraction.text.Vectorizer 
for feature extraction of text documents I have.


I am now curious and don't quite understand how the TFIDF calculation is 
done. Is it done seperately for each document or based on all documents. 
It can't be done for each class of documents, because information about 
the labels is not available.


Hope you can give me some explanations regarding this.

Thanks!

Philipp
--
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer

The IDF statistics is computed once on the whole training corpus as
passed to the `fit` method and then reused on each call to the
`transform` method.

For a train / test split on typically call fit_transform on the train
split (to compute the IDF vector on the train split only) and reuse
those IDF values on the test split by calling transform only:

  vec = TfidfVectorizer()
  tfidf_train = vec.fit_transform(documents_train)
  tfidf_test = vec.transform(documents_test)

The TF-IDF feature extraction per-se is unsupervised (it does not need
the labels). You can then train a supervised classifier on the output
to use the class of the document and pipeline both to get a document
classifier.

The new documentation is here:

   
http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction

Here is a sample pipeline:

   
http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html

Alright, thanks for the ehads up. That's exactly the way I am using it.

Okay, so the tfidf values are for the whole corpus.

Wouldn't it make sense to just see documents belonging to one class as the 
corpus for the calculation?

Regards,
Philipp
 
http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html

--
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

2012-01-24 Thread Philipp Singer
Am 15.01.2012 19:45, schrieb Gael Varoquaux:
 On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote:
 The problem is that my representation is very sparse so I have a huge
 amount of zeros.
 That's actually good: some of our estimators are able to use a sparse
 representation to speed up computation.

 Furthermore the dataset is skewed so one class takes a huge amount of
 labels and another one is also pretty high.
 I have successfully used logistic regression and I could achieve a
 recall of about (in the best case dataset) 65%. I am pretty happy with
 that result. But when looking at the confusion matrix the problem is
 that many examples get mapped to the large class.
 Use class_weight='auto' in the logistic regression to counter the
 effect of un-balanced classes.

 For SVMs, the following example shows the trick:
 http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

 HTH,

 Gael

 --
 RSA(R) Conference 2012
 Mar 27 - Feb 2
 Save $400 by Jan. 27
 Register now!
 http://p.sf.net/sfu/rsa-sfdev2dev2
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Thanks a lot for the help! This helped out quite a bit. But I am still 
not entirely happy with the results. Maybe some further ideas?

Thanks a lot
Philipp

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general