Re: [Scikit-learn-general] Multi-target regression
Is there a description about this somewhere? I can’t find it in the docu. Thanks! Am 05.09.2014 um 18:40 schrieb Flavio Vinicius flavio...@gmail.com: I the case of LinearRegression independent models are being fit for each response. But this is not the case for every multi-response estimator. Afaik, the multi response regression forests in sklearn will consider the correlation between features. -- Flavio On Fri, Sep 5, 2014 at 11:03 AM, Philipp Singer kill...@gmail.com wrote: Hey! I am currently working with data having multiple outcome variables. So for example, my outcome I want to predict can be of multiple dimension. One line of the data could look like the following: y = [10, 15] x = [13, 735478, 0.555, …] So I want to predict all dimensions of the outcome. I have seen that some algorithms can predict such multiple targets. I have tried it with LinearRegression and it seems to work fine. I have not found a clear description of how this works though. Does it fit one Regression separately for each outcome variable? Best, Philipp -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Sparse Random Projection negative weights
Hi, I asked a question about the sparse random projection a few days ago, but thought I should start a new topic regarding my current problem. I am calculating TFIDF weights for my text documents and then calculate cosine similarity between documents for determining the similarity between documents. For dimensionality reduction I am using the Sparse Random Projection class. My current process looks like the following: docs = [text1, text2,…] vec = TfidfVectorizer(max_df=0.8) X = vec.fit_transform(docs) proj = SparseRandomProjection() X2 = proj.fit_transform(X) X2 = normalize(X2) #for L2 normalization sim = X2 * X2.T It works reasonable well. However, I found out that the sparse random projection sets many weights to a negative value. Hence, also many similarity scores end up being negative. Given the original intention of tfidf weights (which should never be negative) and corresponding cosine similarity scores (which then should always only range between zero and one), I do not know whether this is an appropriate approach for my task. Hope someone has some advice. Maybe I am also doing something wrong here. Best, Philipp -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Sparse Random Projection negative weights
Just another remark regarding this: I guess I can not circumvent the negative cosine similarity values. Maybe LSA is a better approach? (TruncatedSVD) Am 08.08.2014 um 10:35 schrieb Philipp Singer kill...@gmail.com: Hi, I asked a question about the sparse random projection a few days ago, but thought I should start a new topic regarding my current problem. I am calculating TFIDF weights for my text documents and then calculate cosine similarity between documents for determining the similarity between documents. For dimensionality reduction I am using the Sparse Random Projection class. My current process looks like the following: docs = [text1, text2,…] vec = TfidfVectorizer(max_df=0.8) X = vec.fit_transform(docs) proj = SparseRandomProjection() X2 = proj.fit_transform(X) X2 = normalize(X2) #for L2 normalization sim = X2 * X2.T It works reasonable well. However, I found out that the sparse random projection sets many weights to a negative value. Hence, also many similarity scores end up being negative. Given the original intention of tfidf weights (which should never be negative) and corresponding cosine similarity scores (which then should always only range between zero and one), I do not know whether this is an appropriate approach for my task. Hope someone has some advice. Maybe I am also doing something wrong here. Best, Philipp -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Sparse Random Projection negative weights
I always normalize X prior to the random projection as I have observed that this always produces more accurate results (same for LSA/SVD). Have not tried to increase eps as this would lead to much less features and more error. I am also not sure how I should alter the density parameter. I feel safer to use it to the auto value which calculates it according to the Li et al paper. Could you recommend some value? I think I will be more effective with LSA for now. Are there any specific recommendations for the number of components? Chose 300 for now. Best, Philipp Am 08.08.2014 um 13:14 schrieb Arnaud Joly a.j...@ulg.ac.be: Have you tried to increase the number of components or epsilon parameter and density of the SparseRandomProjection? Have you tried to normalise X prior the random projection? Best regards, Arnaud On 08 Aug 2014, at 12:19, Philipp Singer kill...@gmail.com wrote: Just another remark regarding this: I guess I can not circumvent the negative cosine similarity values. Maybe LSA is a better approach? (TruncatedSVD) Am 08.08.2014 um 10:35 schrieb Philipp Singer kill...@gmail.com: Hi, I asked a question about the sparse random projection a few days ago, but thought I should start a new topic regarding my current problem. I am calculating TFIDF weights for my text documents and then calculate cosine similarity between documents for determining the similarity between documents. For dimensionality reduction I am using the Sparse Random Projection class. My current process looks like the following: docs = [text1, text2,…] vec = TfidfVectorizer(max_df=0.8) X = vec.fit_transform(docs) proj = SparseRandomProjection() X2 = proj.fit_transform(X) X2 = normalize(X2) #for L2 normalization sim = X2 * X2.T It works reasonable well. However, I found out that the sparse random projection sets many weights to a negative value. Hence, also many similarity scores end up being negative. Given the original intention of tfidf weights (which should never be negative) and corresponding cosine similarity scores (which then should always only range between zero and one), I do not know whether this is an appropriate approach for my task. Hope someone has some advice. Maybe I am also doing something wrong here. Best, Philipp -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Sparse Random Projection Issue
Hi all, I am currently trying to calculate all-pairs similarity between a large number of text documents. I am using a TfidfVectorizer for feature generation and then want to calculate cosine similarity between the pairs. Hence, I am calculating X * X.T between the L2 normalized matrices. As my data is very large (X.shape = (350363, 2526183)), I thought about reducing the dimensionality first. I am using the SparseRandomProjection for this task with the default parameters. I do not normalize the tfidf features first, then perform the random projection and then L2 normalize the resulting data before I multiply the matrix with its transpose. Unfortunately, the resulting similarity scores are outside the expected 10% error. Mostly somewhere around 20%. Does anyone know what I am doing wrong? Apart from that, does anyone know a solution of how I can efficiently calculate the resulting matrix Y = X * X.T? I am currently thinking about using PyTables with some sort of chunked calculation algorithm. Unfortunately, this is not the most efficient way of doing it in terms of speed but solves the memory bottleneck. I need the raw similarity scores between all documents in the end. Thanks! Best, Philipp -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Sparse Random Projection Issue
Am 04.08.2014 um 20:54 schrieb Lars Buitinck larsm...@gmail.com: 2014-08-04 17:39 GMT+02:00 Philipp Singer kill...@gmail.com: Apart from that, does anyone know a solution of how I can efficiently calculate the resulting matrix Y = X * X.T? I am currently thinking about using PyTables with some sort of chunked calculation algorithm. Unfortunately, this is not the most efficient way of doing it in terms of speed but solves the memory bottleneck. I need the raw similarity scores between all documents in the end. Just decompose it: for i in range(0, X.shape[0], K): Y_K = X * X[i:i+K].T store_on_a_big_disk(Y_K) This may work. Interesting that scipy can handle this „dimension mismatch“. Do you know how to do this with numpy arrays? Would you suggest to store the result in a PyTable or memmap or maybe something else? (You can also use batches of rows instead of batches of columns, just make sure you have a 1TB disk available.) -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Sparse Random Projection Issue
Am 04.08.2014 um 22:14 schrieb Philipp Singer kill...@gmail.com: Am 04.08.2014 um 20:54 schrieb Lars Buitinck larsm...@gmail.com: 2014-08-04 17:39 GMT+02:00 Philipp Singer kill...@gmail.com: Apart from that, does anyone know a solution of how I can efficiently calculate the resulting matrix Y = X * X.T? I am currently thinking about using PyTables with some sort of chunked calculation algorithm. Unfortunately, this is not the most efficient way of doing it in terms of speed but solves the memory bottleneck. I need the raw similarity scores between all documents in the end. Just decompose it: for i in range(0, X.shape[0], K): Y_K = X * X[i:i+K].T store_on_a_big_disk(Y_K) This may work. Interesting that scipy can handle this „dimension mismatch“. Do you know how to do this with numpy arrays? Would you suggest to store the result in a PyTable or memmap or maybe something else? Please, forget my comment about dimension mismatch. (You can also use batches of rows instead of batches of columns, just make sure you have a 1TB disk available.) -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] TFIDF question
Hi there, I am currently working with the TfidfVectorizer provided by scikit learn. However, I just came up with a problem/question. In my case I have around 20 very long documents. Some terms in these documents occur much, much more frequently than others. From my pure intuition, these terms should get penalized heavily (close to zero) with the Tfidf procedure. Nevertheless, when I look up the top tfidf terms for each document, such high frequent terms are on the top of the list even though they occur in each single document. I took a deeper look into the specific values, and it appears that all these terms - which occur in _every_ document - receive idf values of 1. However, shouldn't these be zero? Because if they are one, the extreme high frequency (tf) counts overrule the aspect that idf should provide, and rank them to the top. I think this is done in the TfidfTransformer in this line: # avoid division by zeros for features that occur in all documents idf = np.log(float(n_samples) / df) + 1.0 Why is this specifically done? I thought the division by zero is already covered by the smoothing. There seems to be no additional division necessary from my understanding, because finally you only calculate tf * idf. Hope someone can help me out. Cheers, Philipp -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] TFIDF question
Alright! By removing the +1 the results seem much more legit. Also, the sublinear transformation makes sense. However, why use min_df=2 if I am worried about very common words? -Ursprüngliche Nachricht- Von: Lars Buitinck [mailto:larsm...@gmail.com] Gesendet: Freitag, 29. November 2013 14:08 I think this is done in the TfidfTransformer in this line: # avoid division by zeros for features that occur in all documents idf = np.log(float(n_samples) / df) + 1.0 Why is this specifically done? I thought the division by zero is already covered by the smoothing. There seems to be no additional division necessary from my understanding, because finally you only calculate tf * idf. I think this is a workaround for a bug in a previous iteration of tfidf. You can try turning it off and maybe we should turn it off in master, or replace it with log(n_samples / (df + 1.)). Anyway, if you're worried about very common words, try setting min_df=2, and if you have a few long documents, try sublinear_tf=True. That replaces tf with 1 + log(tf) so repeated occurrences of a word get penalized. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] logsum algorithm
Hi, Seems to be that this is simply the so-called logsum trick. It's actually used for underflow problems, as you already mention. This great video might help: http://www.youtube.com/watch?v=-RVM21Voo7Q Regards, Philipp Am 29.08.2013 19:32, schrieb David Reed: Hello, Was hoping someone could shed some light on the added complexity of subtracting maxv and then adding it back in at the end: @cython.boundscheck(False) def _logsum(int N, np.ndarray[dtype_t, ndim=1] X): cdef int i cdef double maxv, Xsum Xsum = 0.0 maxv = X.max() for i in xrange(N): Xsum += exp(X[i] - maxv) return log(Xsum) + maxv Im pretty sure its to mitigate underflow or overflow errors, but seems like those could still be issues. -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well. You can follow the discussion about my problem here [1] if you scroll down - ignore the starting discussion. The best solution I ended up was the one suggested by Olivier. You basically train a linear classifier on your lexical features and then use the predict_proba outcome and your additional categorical features for training a second classifier - for example random forests. It was also helpful to perform leave-one-out when training the probabilities (if you have few samples). [1] http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5F2BJ_ms51a-61HwmNrAyRTb1W0KK7ziBPzGAcdiBRqQ%40mail.gmail.comforum_name=scikit-learn-general If you find out anything else, let us know ;) Regards, Philipp Am 01.06.2013 20:30, schrieb Christian Jauvin: Hi, I asked a (perhaps too vague?) question about the use of Random Forests with a mix of categorical and lexical features on two ML forums (stats.SE and MetaOp), but since it has received no attention, I figured that it might work better on this list (I'm using sklearn's RF of course): I'm working on a binary classification problem for which the dataset is mostly composed of categorical features, but also a few lexical ones (i.e. article titles and abstracts). I'm experimenting with Random Forests, and my current strategy is to build the training set by appending the k best lexical features (chosen with univariate feature selection, and weighted with tf-idf) to the full set of categorical features. This works reasonably well, but as I cannot find explicit references to such a strategy of using hybrid features for RF, I have doubts about my approach: does it make sense? Am I diluting the power of the RF by doing so, and should I rather try to combine two classifiers specializing on both types of features? http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-and-lexical-features Thanks, Christian -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Fit functions
Dictionaries do not have duplicate keys (labels). You could only make a list of datawithLabelX for each key label. But what is the benefit of this? Philipp Am 05.04.2013 11:37, schrieb Bill Power: i know this is going to sound a little silly, but I was thinking there that it might be nice to be able to do this with scikit learn clf = sklearn.anyClassifier() clf.fit( { 0: dataWithLabel0, 1: dataWithLabel1 } ) instead of having to separate the data/labels manually. i guess fit would do that internally, but it might be nice to have this bill -- Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Multiple training instances in the HMM library
Well, you can quite easily append multiple sequences to each other by introducing a RESET state that you append to the first sequence and then you add the next and so on. As the HMM afaik only supports first orders this should work quite well. Regards, Philipp Am 18.03.2013 21:42, schrieb Leon Palafox: Yes, I meant that, I think is a very important functionality, since is the one that would allow us to put nice Speech Recognition examples as well as other niceties. On Mon, Mar 18, 2013 at 1:34 PM, Lars Buitinck l.j.buiti...@uva.nl mailto:l.j.buiti...@uva.nl wrote: 2013/3/18 Leon Palafox leonoe...@gmail.com mailto:leonoe...@gmail.com: I know the HMM library is in a so-so case, but I was wondering whether it has the capability of learning from multiple training examples, since the examples in the site all focus on single trial cases. You mean multiple sequences? Last time I checked it couldn't. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Leon Palafox, M.Sc PhD Candidate Iba Laboratory +81-3-5841-8436 University of Tokyo Tokyo, Japan. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Multiple training instances in the HMMlibrary
To be honest, I am not that familiar with Hidden Markov Models yet, but apply very frequently Markov chain models. In such a case this is a standard technique of using independent sequencies at once for training the model. So let's assume we work with first-order Markov chains and have two independent sequencies given: a - b - c d - b - a Then I would introduce a generic reset state noted as R and alternate the paths the following way (the first R can make sense or not, depends what you want to achieve, but generally I would apply it): (R) - a - b - c - R - d - b - a - R You train your MM (HMM) with this sequence and with the first order property it is no problem because the memoryless assumption implies that we forget everything before a Reset state. Then for a test sequence you may have for example: R - b - b - d - R As mentioned, I have not tested this with HMMs, but for Markov chains this makes sense and works fine. Regards, Philipp Am 18.03.2013 21:59, schrieb Didier Vila: Any code of your example is more than welcome Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | Hampshire | GU51 2QQ | Tel: 0871 574 7989 | Fax: 0871 574 2992 | Email: dv...@capquestco.com mailto:mbruna...@capquestco.com *From:*Leon Palafox [mailto:leonoe...@gmail.com] *Sent:* 18 March 2013 20:57 *To:* scikit-learn-general@lists.sourceforge.net *Subject:* Re: [Scikit-learn-general] Multiple training instances in the HMMlibrary But I agree, is one hack that can be done outside of the code On Mon, Mar 18, 2013 at 1:56 PM, Leon Palafox leonoe...@gmail.com mailto:leonoe...@gmail.com wrote: Yeah, but wouldn't that beat the whole point of training an HMM to learn batches of data of Length N If I'm following you, you would append K sequences of length N, ending with a whole sequence of size K*N, and when you have a new observation, of length N, in order to predict, you would have to tile it so it fits the shape of the whole Model, and each of the training examples can evaluate the new observation? Sounds even nastier. On Mon, Mar 18, 2013 at 1:49 PM, Philipp Singer kill...@gmail.com mailto:kill...@gmail.com wrote: Well, you can quite easily append multiple sequences to each other by introducing a RESET state that you append to the first sequence and then you add the next and so on. As the HMM afaik only supports first orders this should work quite well. Regards, Philipp Am 18.03.2013 21:42, schrieb Leon Palafox: Yes, I meant that, I think is a very important functionality, since is the one that would allow us to put nice Speech Recognition examples as well as other niceties. On Mon, Mar 18, 2013 at 1:34 PM, Lars Buitinck l.j.buiti...@uva.nl mailto:l.j.buiti...@uva.nl wrote: 2013/3/18 Leon Palafox leonoe...@gmail.com mailto:leonoe...@gmail.com: I know the HMM library is in a so-so case, but I was wondering whether it has the capability of learning from multiple training examples, since the examples in the site all focus on single trial cases. You mean multiple sequences? Last time I checked it couldn't. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Leon Palafox, M.Sc PhD Candidate Iba Laboratory +81-3-5841-8436 tel:%2B81-3-5841-8436 University of Tokyo Tokyo, Japan. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Leon Palafox, M.Sc PhD Candidate Iba
Re: [Scikit-learn-general] Data format
Why do you want to convert libsvm to another structure? I don't quite get it. If you want to use examples: scikit learn has included datasets that can be directly loaded. I think this section should help: http://scikit-learn.org/stable/datasets/index.html Am 08.03.2013 18:44, schrieb Mohamed Radhouane Aniba: Hello ! I am wondering if someone has developed a snippet or a script that converts libsvm format into a format directly usable by scikit without the need to use of load_svmlight_file. The reason is that I am trying to use the examples provided on the website, but all of them are written in a format that is not a libsvm one. Thanks Rad -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Get every package once and for all
Well the reason may be that EPD does not have the newest scikit learn distribution included. Afaik AdaBoost is only included to 0.14 which is the current development version which you have to install by hand. Regards, Philipp Am 07.03.2013 19:55, schrieb Mohamed Radhouane Aniba: Hello I am just starting using scikit as you might guess by now. Something is really frustrating about it. I am trying to run examples from the website to get used to the kit, some just work fine, some other are not working because of missing library. Example I am trying to get plot_classifier_comparison.py to work but I have an error message saying : ImportError: cannot import name AdaBoostClassifier Other classifier work fine, why some are not recognized ? Can someone point me to the way to get everything working once and for all, even those packages we will not necessarily use. I am using Macbook pro, EPD kit (python) Thanks Rad -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Imbalance in scikit-learn
Hey! One simple solution that often works wonders is to set the class_weight parameter of a classifier (if available) to 'auto' [1]. If you have enough data, it often also makes sense to balance the data beforehand. [1] http://scikit-learn.org/dev/modules/svm.html#unbalanced-problems Am 25.02.2013 14:02, schrieb Maor Hornstein: I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues. Does anyone know a solution for imbalance in scikit-learn or in python in general? Thanks :) -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] named entity extraction
Hey guys! I currently have the problem of doing named entity extraction on relatively short sparse textual input. I have a predefined set of concepts and training and test data. As I have no real experience with such a thing, I wanted to ask if you can recommend any technique, preferable working via scikit learn. Thanks and many regards, Philipp -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Multilabel questions
Yep, I know that. The PR looks promising, will look into it. Just another question: If the OVR predicts multiple labels for a sample, are they somehow ranked? I know it is just the one vs rest approach, but maybe there is some kind of confidence involved. Because then the evaluation would be interesting, by looking at rankings. Regards, Philipp Am 24.01.2013 09:56, schrieb Joly Arnaud: You should also be aware that the current metrics module doesn't handle multilabels correctly. The following pr https://github.com/scikit-learn/scikit-learn/pull/1606 might interest you. It had for multi-labels support for some metrics. Best regards, Arnaud Joly Le 23/01/2013 18:44, Andreas Mueller a écrit : Am 23.01.2013 18:39, schrieb Lars Buitinck: if you want more predictions or something... More in detail: OneVsRestClassifier exports an object called label_binarizer_, which is used to transform decision function values D back to class labels. By default, it picks all the classes for which D 0, but its threshold argument can be used to change that. So, if clf is an OvR classifier and D = clf.decision_function(x) for a *single sample* x contains no positive values, then # untested, may contain mistakes clf.label_binarizer_.inverse_transform(D, threshold=(D.max() + epsilon)) will predict at least one class label for x, namely the one with the highest value according to the decision_function. The epsilon is needed because inverse_transform compares values using , not =; set it to a small value. Doing this for batches of samples is a bit more involved. Of course, you can set the threshold to any value. Whether any of this makes sense depends on your problem. [I used to be opposed to exporting the LabelBinarizer object on OvR estimators, but I guess I should give up the struggle now -- this is actually useful.] I didn't even realize this possibility existed. I would have done it by hand. Thanks for the instructions. -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Multilabel questions
Hey guys! I am currently trying to do multilabel prediction using textual features (e.g., tfidf). My data consists of a different amount of labels for a sample. One can have just one label and one can have 10 labels. I now simply built a list of tuples for my y vector. So for example: (19, 8, 7, 5) (8, 22, 23, 6, 18, 3) (22,) ... I have decided as first step to use LinearSVC. When I train the classifier with about 10.000 samples all works fine and also the prediction output looks fine. But as soon as I use all my samples (~300.000) my python.exe crashes in Windows. So I tried it on my Linux server, and I get a segfault error. Does anyone know how this can happen? Am I probably doing something wrong? I have some more questions regarding multilabel classification, but let's stick to this first ;) Many Regards, Philipp -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Multilabel questions
Hey, That's what I originally thought, but then I tried it with just using LinearSVC and it magically worked for my sample dataset, really interesting. I think it is working now properly. What I am asking myself is how exactly the decision is made for the multilabel prediction. Is there some way of influencing it? For example sometimes it predicts zero classes and sometimes several. Is it also possible to pass a MultinomialNB to the OVR classifier? Or would I just use the predict_proba output and then decide myself how many and which labels I would predict? Regards, Philipp Am 23.01.2013 16:33, schrieb Andreas Mueller: Hi Philipp. LinearSVC can not cope with multilabel problems. It seems it is not doing enough input validation. You have to use OneVsRestClassifier together with LinearSVC to do that afaik. Cheers, Andy Am 23.01.2013 16:27, schrieb Philipp Singer: Hey guys! I am currently trying to do multilabel prediction using textual features (e.g., tfidf). My data consists of a different amount of labels for a sample. One can have just one label and one can have 10 labels. I now simply built a list of tuples for my y vector. So for example: (19, 8, 7, 5) (8, 22, 23, 6, 18, 3) (22,) ... I have decided as first step to use LinearSVC. When I train the classifier with about 10.000 samples all works fine and also the prediction output looks fine. But as soon as I use all my samples (~300.000) my python.exe crashes in Windows. So I tried it on my Linux server, and I get a segfault error. Does anyone know how this can happen? Am I probably doing something wrong? I have some more questions regarding multilabel classification, but let's stick to this first ;) Many Regards, Philipp -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] ANN: scikit-learn 0.13 released!
Great work as always guys! Eager to try out the new features, especially the feature hashing. Am 22.01.2013 00:02, schrieb Andreas Mueller: Hi all. I am very happy to announce the release of scikit-learn 0.13. New features in this release include feature hashing for text processing, passive-agressive classifiers, faster random forests and many more. There have also been countless improvements in stability, consistency and usability. Details can be found on the what's new http://scikit-learn.org/stable/whats_new.htmlpage. Sources and windows binaries are available on sourceforge, through pypi (http://pypi.python.org/pypi/scikit-learn/0.13) or can be installed directly using pip: pip install -U scikit-learn A big thank you to all the contributors who made this release possible! In parallel to the release, we started a small survey https://docs.google.com/spreadsheet/viewform?formkey=dFdyeGNhMzlCRWZUdldpMEZlZ1B1YkE6MQ#gid=0 to get to know our user base a bit more. If you are using scikit-learn, it would be great if you could give us your input. Best, Andy -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122412 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] does anyone do dot( sparse vec, sparse vec ) ?
Am 27.12.2012 18:32, schrieb Olivier Grisel: 2012/12/27 denis denis-bz...@t-online.de: Olivier Grisel olivier.grisel@... writes: 2012/12/27 denis denis-bz-gg@...: Folks, does any module in scikit-learn do dot( sparse vec, sparse vec ) a lot ? I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n) but so far I see only safe_sparse_dot( big sparse array, numpy array ) e.g. for RandomPCA. The speed of the sparse matrix dot sparse matrix depends on the actual implementation of the scipy.sparse matrices. Olivier, sorry, I wasn't clear: I want to try out my fast NEW implementation of dot( sparse vec, sparse vec ) and am looking for a testcase in scikit-learn that does a lot of those to measure the speedup cheers Alright. AFAIK we don't have a use case in scikit-learn for that kind of operation yet. Computing k-nn queries using cosine similarity on a pre-normalized sparse vector corpus + query might be a valid use case though. I agree. You could do something like all pairs cosine similarity using a large sparse matrix. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Get classification report inside grid search or cv
Hey! Is it possible to somehow get detailed prediction information inside grid search or cross validation for individual folds or grids. So i.e., I want to know how my classes perform for each of my folds I am doing in GridSearchCV. I can read the average scores using grid_scores_ and this is fine, but I want to see information one step deeper. It would be enough to get y_true and y_predicted for each fold. Regards, Philipp -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Append additional data in pipeline
It's probably better to train a linear classifier on the text features alone and a second (potentially non linear classifier such as GBRT or ExtraTrees) on the predict_proba outcome of the text classifier + your additional low dim features. This is some kind of stacking method (a sort of ensemble method). It should make the text features not overwhelm the final classifier if the other features are informative. Hey Olivier! Thanks for the hints. I just tried it, but unfortunately the results are much worse than just using my textual features alone. just to be sure if I am doing it right: At first I create my textual features using a vectorizer. Then I fit a linear SVC on these features (training data ofc) and use predict_proba for my training samples again resulting in a probability distribution of dimension 7 (I have 7 classes). Then I append my additional features (those are 15) and fit another classifier on the new data. (I tried several scaling/normalizing ideas without improvement) I do the same procedure for test data. (Btw I do cross val) While I get 0.85 f1 score for just using textual data the combined approach results in only 0.4. Regards, Philipp -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Append additional data in pipeline
Am 04.12.2012 12:26, schrieb Andreas Mueller: Am 04.12.2012 12:20, schrieb Olivier Grisel: 2012/12/4 Philipp Singer kill...@gmail.com: It's probably better to train a linear classifier on the text features alone and a second (potentially non linear classifier such as GBRT or ExtraTrees) on the predict_proba outcome of the text classifier + your additional low dim features. This is some kind of stacking method (a sort of ensemble method). It should make the text features not overwhelm the final classifier if the other features are informative. Hey Olivier! Thanks for the hints. I just tried it, but unfortunately the results are much worse than just using my textual features alone. just to be sure if I am doing it right: At first I create my textual features using a vectorizer. Then I fit a linear SVC on these features (training data ofc) and use predict_proba for my training samples again resulting in a probability distribution of dimension 7 (I have 7 classes). Then I append my additional features (those are 15) and fit another classifier on the new data. (I tried several scaling/normalizing ideas without improvement) I do the same procedure for test data. (Btw I do cross val) While I get 0.85 f1 score for just using textual data the combined approach results in only 0.4. Have you scaled your additional features to the [0-1] range as the probability features from the text classifier? If you do a full grid search of the SVC hyperparameters (e.g. kernel linear or rbf and C + gamma for RBF only) there is no reason that the stacked model could be worth than the original text classifier (unless you have very few samples and that the additional features are pure noise). Can't the stacked model be worse because of overfitting issues? I guess if you include a linear SVM, it might be able to learn the identity and be as good as the original classifier. With only RBF-SVM, I'm not sure this is possible. But testing just a linear SVM should definitely not make things worse if the grid search is done correctly. I use a linear SVM for learning my probabilities for the samples (I have used grid search for determining the optimal paramters). Then I append the additional features and do as suggested gradient boosting or extra tree classifier. What do you mean by testing just a linear SVM? On my new feature space? Btw, I just have 64 samples. I will try to append the probability features using leave-one-out now. -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Append additional data in pipeline
Have you scaled your additional features to the [0-1] range as the probability features from the text classifier? Until now I performed Scaler() (im on 0.12 atm) on the new feature space. Should I do this on my appended features only? But well, they are not exactly between 0 or 1 then. I would probably need MinMaxScaler from 0.13 which I cant access atm. -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Append additional data in pipeline
Am 04.12.2012 15:15, schrieb Olivier Grisel: 2012/12/4 Philipp Singer kill...@gmail.com: Have you scaled your additional features to the [0-1] range as the probability features from the text classifier? Until now I performed Scaler() (im on 0.12 atm) on the new feature space. Should I do this on my appended features only? But well, they are not exactly between 0 or 1 then. I would probably need MinMaxScaler from 0.13 which I cant access atm. Variance based scaling should be good enough. Interestingly, I get worse results when I scale using an ExtraTreesClassifier than if I just leave the features as they are (i.e., probability features between 0 and 1 and the rest something else). Also normalizing with axis 0 doesn't help. Regarding the low number of samples: I agree, but cant change that atm :( -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Append additional data in pipeline
Thanks to Andreas I got it working now using a custom estimator for the pipeline. I am still struggling a bit to combine textual features (e.g., tfidf) with other features that work well on their own. At the moment, I am just concatanating them -- enlarging the vector. The problem now is, that the few added features do not seem to have any impact on the classifier, as the accuracy is exactly the same as if I would use only textual features. They just seem to be overwhelmed by the huge amount of textual features. Is there now some clever way of combining both feature types? Like probably using composite/multiple kernels? Maybe someone has an idea about that. This is actually a thing, I am struggling for a bit now and still haven't found a clever way of solving it. Regards, Philipp -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Potential problem with Leave-one-out and f1_score
Hey! First of all: thanks for the hints for my last post. I decided to stick around Leave-one-Out for now and Im doing grid search with cross validation using Leave-one-out. As I am interested in retrieving the F1_score I am using it as score_func. The problem now is that following error message comes up: ValueError: pos_label=1 is not a valid label: array([ 0., 3.]) The problem seems to be, that the score_func thinks that it's a binary classification and needs a pos_label that fits to the labels in this case 0 or 3. Nevertheless, it is a multiclass classification. Passing pos_label=None doesn't work as well in this case. Does anyone have a hint what I am doing wrong? Thanks Philipp -- Keep yourself connected to Go Parallel: TUNE You got it built. Now make it sing. Tune shows you how. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Append additional data in pipeline
Hey again! Today is my posting day, hope you don't bother, but I just stumbled upon a further problem. I currently use a grid search strtaifiedkfold approach that works on textual data. So I use a pipeline that does tfidf vectorization as well. The thing now is, that I want to append additional features that are not textual to the feature data. Is there some way of doing so in the pipeline? Of course, I could do the tfidf transformations etc beforehand and append the additional features there, but well, then the whole idea of just fitting on training data etc is lost. Regards, Philipp -- Keep yourself connected to Go Parallel: TUNE You got it built. Now make it sing. Tune shows you how. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Append additional data in pipeline
Am 30.11.2012 17:31, schrieb Andreas Mueller: Am 30.11.2012 16:58, schrieb Philipp Singer: Hey again! Today is my posting day, hope you don't bother, but I just stumbled upon a further problem. I currently use a grid search strtaifiedkfold approach that works on textual data. So I use a pipeline that does tfidf vectorization as well. The thing now is, that I want to append additional features that are not textual to the feature data. This kind of (but not really) sounds like a job for FeatureUnion: http://scikit-learn.sourceforge.net/dev/modules/pipeline.html#featureunion-combining-feature-extractors Feature union applies to different transformers to the same input data. But you already start with two kinds of data, right? Yep exactly. One with textual data and the other with other kind of features. I guess you could make your data be a list of tuples (text, non-test). Then you would still need a transformer that projects to the components, though. This might not be ideal. I thought about building a custom transformer, that I can pass to the pipeline that somehow appends the features to train and test data. But the problem is, that I don't know exactly which data is used for the splits (i.e., with sample). How would you do it with a list of tuples? Though I have no better idea. Cheers, Andy Thanks, Philipp -- Keep yourself connected to Go Parallel: TUNE You got it built. Now make it sing. Tune shows you how. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Keep yourself connected to Go Parallel: TUNE You got it built. Now make it sing. Tune shows you how. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Cross validation iterator - leave one out per class
Hey! I have the following scenario: I have e.g., three different classes. For class 0 I may have 6 samples, for class 1 ten and for class 2 four. I now want to do cross validation ten times, but in my case I want to train on all samples for a class except one which I want to use as test data. I know that there is a Leave-One-Out mechanism in scikit learn but this just leaves one total example out, I now want to leave one out for each class. Does this even make sense? ;) If so, is there some easy way of doing so in scikit learn? Regards, Philipp -- Keep yourself connected to Go Parallel: VERIFY Test and improve your parallel project with help from experts and peers. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] All-pairs-similarity calculation
Am 27.10.2012 23:43, schrieb Joseph Turian: If you only care about near matches and not the full n^2 matrix: +1 to OG's suggestion to use pylucene. You can use pylucene to generate candidates, and then compute the exact tf*idf cosine distance on the shortlist. Yes exactly. I would only need the most similar matches. The problem with the lucene solution is that I do not need tfidf. I really have to do simple cosine similarity on my available vectors. So e.g., my matrix (vectors) look the following way: [[1 2 5] [3 1 0]] Now get the cosine similarity between row one and two or in this case get the most similar row given row one using cosine similarity without any further variations. As already mentioned I have the data in sparse form. I assume this will be n log n. Another option for fast all-pairs is to use locality sensitive hashing. (I didn't read the papers or see if that's what they do.) It is not clear what the accuracy will be, but it will probably be the fastest. ] Yeah, some kind of dimension reduction is another option, but actually this would be very hard for me because I have already done all my previous experiments on the complete representations, so if I could find any faster solution for my problem this would be awesome. Regards, Philipp On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer kill...@gmail.com wrote: Am 26.10.2012 15:35, schrieb Olivier Grisel: BTW, in the mean time you could encode your coocurrences as text identifiers use either Lucene/Solr in Java using the sunburnt python client or woosh [1] in python as a way to do efficient sparse lookups in such a sparse matrix to be able to quickly compute the non zero cosine similarities between all pairs. Solr also as MoreLikeThis queries that can be used to truncate the search to the top most similar samples in the set of samples in the case you have some very frequent non zero features that would mostly break the sparsity of the cosine similarity matrix. As Trey Grainger says in his talk Building a real time, solr-powered recommendation engine: A Lucene index is a multi-dimensional sparse matrix… with very fast and powerful lookup capabilities. [1] http://packages.python.org/Whoosh/quickstart.html [2] http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine Thanks, this looks promising. What do you exactly mean, by encoding cooccurrences as text identifiers? How would I handle my sparse vectors then? I know the MoreLikeThis functionality, but does it exactly do cosine similarity? The thing is, that I need this relatedness emasure for my studies. Philipp -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] All-pairs-similarity calculation
Hey there! Currently I am working on very large sparse vectors and have to calculate similarity between all pairs of them. I have now looked into the available code in scikit-learn and also at corresponding literature. So I stumbled upon this paper [1] and the corresponding implementation [2]. I was now thinking, if this would be a potential improvement / help for scikit-learn for working with very large feature files where it is still necessary to calculate the pair-wise similarity of vectors for different classificators or other tasks. So the goal would be to speed this whole thing up. I am by far no expert in this thing, but just wanted to ask you guys about your opinion ;) Regards, Philipp [1] http://www.bayardo.org/ps/www2007.pdf [2] http://code.google.com/p/google-all-pairs-similarity-search/ -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] All-pairs-similarity calculation
Am 26.10.2012 14:27, schrieb Olivier Grisel: 2012/10/26 Philipp Singer kill...@gmail.com: Hey there! Currently I am working on very large sparse vectors and have to calculate similarity between all pairs of them. How many features? Are they sparse? If so which sparsity level? In detail: I have a large co-occurrence matrix with a shape of around 3.7Mill x 3.7Mill. Yes, they are sparse, but I can't tell you the exacty sparsity level right now, but as it seems they should be very sparse because a single element does not have a co-occurrence count to a large number of other elements in my case. The problem is that I need cosine similarity in my case, so I also can't use the specific suitable implementations of distances available in numpy, scipy or scikit-learn, but I just pass over a callable function that does the job. (Currently, I am using a complete own implementation for this, because it is just impossible to calculate all-pairs-similarity for my large data at the moment) I have now looked into the available code in scikit-learn and also at corresponding literature. So I stumbled upon this paper [1] and the corresponding implementation [2]. I was now thinking, if this would be a potential improvement / help for scikit-learn for working with very large feature files where it is still necessary to calculate the pair-wise similarity of vectors for different classificators or other tasks. So the goal would be to speed this whole thing up. I am by far no expert in this thing, but just wanted to ask you guys about your opinion ;) Computing the sparse cosine similarity matrix of a large (both n_samples and n_features) is really lacking in scikit-learn and I wanted to implement some tools to do this efficiently when working on my power iteration clustering pool request some time ago but never found the time to do it. My idea was to use an in-memory inverted index structure, similar to fulltext indexer such as lucene but using integer feature indices rather than string feature names / tokens. Such a data structure would also be interesting for the sklearn.neighbors to do efficient k-nearest neighbors multiclass or multilabel classification on high dimensional sparse data (which we don't address efficiently with the current BallTree datastructure that is optimal for less than 100 dense features). That would be awesome as I already had the impression that k-nearest neighbors works very slow for large data in scikit-learn and that was also the link to classification I made above for which this would be helpful to. [1] http://www.bayardo.org/ps/www2007.pdf [2] http://code.google.com/p/google-all-pairs-similarity-search/ Thanks for the links, added them to my reading list. Perfect ;) Regards, Philipp -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] How to save an array of models
Am 17.10.2012 20:57, schrieb Kenneth C. Arnold: import cPickle as pickle # faster on Py2.x, default on Py3. with open(filename, 'wb') as f: pickle.dump(obj, f, -1) The -1 at the end chooses the latest file format version, which is more compact. What exactly does -1 do? I guess that's the protocol. I have always used 2 in this case. Didn't know about -1. Regards, Philipp -Ken On Wed, Oct 17, 2012 at 1:31 PM, Niall Twomey twom...@gmail.com mailto:twom...@gmail.com wrote: Hi all. I want to save an array of models trained on lots of data to file. I have tried the following code (roughly speaking anyway) models = [] # Populate the list of models with dict items containing one number and PCA and GMM models import pickle pickle.dump( models.pickle, models ) but I get errors saying: AttributeError: 'list' object has no attribute 'write'. which presumably referrs to the models list. Saving them to file is crucial for me, but I have no idea how to proceed from here. Any advice will be appreciated. Thanks. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Am 14.09.2012 14:53, schrieb Andreas Müller: Hi Philipp. Hey Andreas! First, you should ensure that the features all have approximately the same scale. For example they should all be between zero and one - if the LDA features are much smaller than the other ones, then they will probably not be weighted much. LDA features sum up to 1 for one sample, because they describe the probability of one sample to belong to the different topics (in this case 500). So basically, they are between 0 and 1. Which LDA package did you use? We used Mallet's LDA implementation, because from experience they have the most established smoothing processes. http://mallet.cs.umass.edu/ If we just train on the LDA features we btw get reasonable results, a bit worse than pure TFIDF. I am not very experienced with this kind of model, but maybe it would be helpful to look at some univariate statistics, like ``feature_selection.chi2``, to see if the LDA features are actually helpful. Yeah, this would be something I could look into. I have already tried to to feature selection with chi2 but not actually looked at the specific statistics. Cheers, Andy Regards, Philipp - Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com An: scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 13:47:30 Betreff: [Scikit-learn-general] Combining TFIDF and LDA features Hey there! I have seen in the past some few research papers that combined tfidf based features with LDA topic model features and they could increase their accuracy by some useful extent. I now wanted to do the same. As a simple step I just attended the topic features to each train and test sample with the existing tfidf features and performed my standard LinearSVC - oh btw thanks that the confusion with dense and sparse is now resolved in 0.12 ;) - on it. The problem now is, that the results are overall exactly similar. Some classes perform better and some worse. I am not exactly sure if this is a data problem, or comes from my lack of understanding of such feature extension techniques. Is it possible that the huge amount of tfidf features somehow overrules the rather small number of topic features? Do I maybe have to some feature modification - because tfidf and LDA features are of different nature? Maybe it is also due to the classifier and I need something else? Would be happy if someone could shed a little light on my problems ;) Regards, Philipp -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Am 14.09.2012 15:10, schrieb amir rahimi: Have you done tests using some other classifiers such as gradient boosting which has a kind of internal feature selection? Actually not, but I wanted to try that out, if the runtime allows it. On Fri, Sep 14, 2012 at 5:36 PM, Andreas Müller amuel...@ais.uni-bonn.de mailto:amuel...@ais.uni-bonn.de wrote: I'd be interested in the outcome. Let us know when you get it to work :) - Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com mailto:kill...@gmail.com An: scikit-learn-general@lists.sourceforge.net mailto:scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 14:00:48 Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features Am 14.09.2012 14:53, schrieb Andreas Müller: Hi Philipp. Hey Andreas! First, you should ensure that the features all have approximately the same scale. For example they should all be between zero and one - if the LDA features are much smaller than the other ones, then they will probably not be weighted much. LDA features sum up to 1 for one sample, because they describe the probability of one sample to belong to the different topics (in this case 500). So basically, they are between 0 and 1. Which LDA package did you use? We used Mallet's LDA implementation, because from experience they have the most established smoothing processes. http://mallet.cs.umass.edu/ If we just train on the LDA features we btw get reasonable results, a bit worse than pure TFIDF. I am not very experienced with this kind of model, but maybe it would be helpful to look at some univariate statistics, like ``feature_selection.chi2``, to see if the LDA features are actually helpful. Yeah, this would be something I could look into. I have already tried to to feature selection with chi2 but not actually looked at the specific statistics. Cheers, Andy Regards, Philipp - Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com mailto:kill...@gmail.com An: scikit-learn-general@lists.sourceforge.net mailto:scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 13:47:30 Betreff: [Scikit-learn-general] Combining TFIDF and LDA features Hey there! I have seen in the past some few research papers that combined tfidf based features with LDA topic model features and they could increase their accuracy by some useful extent. I now wanted to do the same. As a simple step I just attended the topic features to each train and test sample with the existing tfidf features and performed my standard LinearSVC - oh btw thanks that the confusion with dense and sparse is now resolved in 0.12 ;) - on it. The problem now is, that the results are overall exactly similar. Some classes perform better and some worse. I am not exactly sure if this is a data problem, or comes from my lack of understanding of such feature extension techniques. Is it possible that the huge amount of tfidf features somehow overrules the rather small number of topic features? Do I maybe have to some feature modification - because tfidf and LDA features are of different nature? Maybe it is also due to the classifier and I need something else? Would be happy if someone could shed a little light on my problems ;) Regards, Philipp -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Okay, so I did a fast chi2 check and it seems like some LDA features have high p-values, so they should be helpful at least. Am 14.09.2012 15:06, schrieb Andreas Müller: I'd be interested in the outcome. Let us know when you get it to work :) - Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com An: scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 14:00:48 Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features Am 14.09.2012 14:53, schrieb Andreas Müller: Hi Philipp. Hey Andreas! First, you should ensure that the features all have approximately the same scale. For example they should all be between zero and one - if the LDA features are much smaller than the other ones, then they will probably not be weighted much. LDA features sum up to 1 for one sample, because they describe the probability of one sample to belong to the different topics (in this case 500). So basically, they are between 0 and 1. Which LDA package did you use? We used Mallet's LDA implementation, because from experience they have the most established smoothing processes. http://mallet.cs.umass.edu/ If we just train on the LDA features we btw get reasonable results, a bit worse than pure TFIDF. I am not very experienced with this kind of model, but maybe it would be helpful to look at some univariate statistics, like ``feature_selection.chi2``, to see if the LDA features are actually helpful. Yeah, this would be something I could look into. I have already tried to to feature selection with chi2 but not actually looked at the specific statistics. Cheers, Andy Regards, Philipp - Ursprüngliche Mail - Von: Philipp Singer kill...@gmail.com An: scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 13:47:30 Betreff: [Scikit-learn-general] Combining TFIDF and LDA features Hey there! I have seen in the past some few research papers that combined tfidf based features with LDA topic model features and they could increase their accuracy by some useful extent. I now wanted to do the same. As a simple step I just attended the topic features to each train and test sample with the existing tfidf features and performed my standard LinearSVC - oh btw thanks that the confusion with dense and sparse is now resolved in 0.12 ;) - on it. The problem now is, that the results are overall exactly similar. Some classes perform better and some worse. I am not exactly sure if this is a data problem, or comes from my lack of understanding of such feature extension techniques. Is it possible that the huge amount of tfidf features somehow overrules the rather small number of topic features? Do I maybe have to some feature modification - because tfidf and LDA features are of different nature? Maybe it is also due to the classifier and I need something else? Would be happy if someone could shed a little light on my problems ;) Regards, Philipp -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Hey! Am 14.09.2012 15:10, schrieb Peter Prettenhofer: I totally agree - I had such an issue in my research as well (combining word presence features with SVD embeddings). I followed Blitzer et. al 2006 and normalized** both feature groups separately - e.g. you could normalize word presence features such that L1 norm equals 1 and do the same for the SVD embeddings. Isn't the normalization alread part of the tfidf transformation? So basically the word presence tfidf features are already L2 normalized, but maybe I misunderstand this completely. In my work I had the impression though, that L1|L2 normalization was inferior to simply scale the embeddings by a constant alpha such that the average L2 norm is 1.[1] Ah, I see. How would I exactly do that? Isn't that the same thing as the normalization technique in scikit-learn is doing? ** normalization here means row level normalization - similar do document length normalization in TF/IDF. HTH, Peter Regards, Philipp Blitzer et al. 2006, Domain Adaptation using Structural Correspondence Learning, http://john.blitzer.com/papers/emnlp06.pdf [1] This is also described here: http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] how to pickel CountVectorizer
Am 08.08.2012 14:53, schrieb David Montgomery: So...does it make sense to pickel CountVectorizer? I just did not want to fit CountVectorizer every time I wanted to score a svm model. It makes sense to pickle the fitted Vectorizer. In this case you are just trying to pickle the plain object. Regards, Philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Incorporation of extra training examples
Am 18.07.2012 15:32, schrieb Peter Prettenhofer: In this case I would fit one MultinomialNB for the foreground model and one for the background model. But how would I do the feature extraction (I have text documents) in this case? Would I fit (e.g., tfidf) on the whole corpus (foreground + background) and then transform both datasets on the fitted infos and the test dataset as well? Personally, I'd start without using IDF; Otherwise, wrap both estimators using a Pipeline and add a TfidfTransformer (see [1]). best, Peter [1] http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html Everything works fine now. The sad thing though is that I still can't really improve the classification results. The only thing I can achieve is to get a higher recall for the classes working well in the background model, but the precision sinks at the same time. Overall I am staying at about the same average score when incorporating the background model. If anyone has any further ideas, please let me know ;) Regards, Philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Incorporation of extra training examples
Am 20.07.2012 11:47, schrieb Lars Buitinck: 2012/7/20 Philipp Singer kill...@gmail.com: Everything works fine now. The sad thing though is that I still can't really improve the classification results. The only thing I can achieve is to get a higher recall for the classes working well in the background model, but the precision sinks at the same time. Overall I am staying at about the same average score when incorporating the background model. If anyone has any further ideas, please let me know ;) Well, since Gael already mentioned semi-supervised training using label propagation: I have an old PR which has still not been merged, mostly because of API reasons, that implements semi-supervised training of Naive Bayes using an EM algorithm: https://github.com/scikit-learn/scikit-learn/pull/430 I've seen improvements in F1 score when doing text classification with this algorithm. It may take some work to get this up to speed with the latest scikit-learn, though. Hey Lars, Thanks, this looks awesome. I will try it out. The reason why I haven't used label propagation techniques yet is, that I could not achieve a fast runtime yet, because I have huge amounts of unlabeled/background data available. (Just out of curiosity, which topic models did you try? I'm looking into these for my own projects.) We have been using Mallet's LDA based Parallel Topic Model. Philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Incorporation of extra training examples
Am 20.07.2012 11:47, schrieb Lars Buitinck: Well, since Gael already mentioned semi-supervised training using label propagation: I have an old PR which has still not been merged, mostly because of API reasons, that implements semi-supervised training of Naive Bayes using an EM algorithm: https://github.com/scikit-learn/scikit-learn/pull/430 I've seen improvements in F1 score when doing text classification with this algorithm. It may take some work to get this up to speed with the latest scikit-learn, though. Hey again! I jsut have tried out your implementation of semi-supervised MultinomialNB. The code works flawless, but unfortunately the performance of the algorithm drops extremely when I trie to incorporate my additional data. I am starting to think that my additional data is useless :/ Just for the record: training on my 96000 labeled data with MultinomialNB gets me a f1-score of 0.47. Using around 2.000.000 unlabeled additional data using your semi-supervised code achieves a f1-score of 0.39 Regards, philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Incorporation of extra training examples
Am 20.07.2012 15:34, schrieb Lars Buitinck: 2012/7/20 Philipp Singer kill...@gmail.com: I jsut have tried out your implementation of semi-supervised MultinomialNB. The code works flawless, but unfortunately the performance of the algorithm drops extremely when I trie to incorporate my additional data. I am starting to think that my additional data is useless :/ Just for the record: training on my 96000 labeled data with MultinomialNB gets me a f1-score of 0.47. Using around 2.000.000 unlabeled additional data using your semi-supervised code achieves a f1-score of 0.39 Hmm, too bad. Is the extra data from a very different source? Not very different, but documents produced by another kind of users. I really thought that this data could improve somehow the whole classification process, because fitting a model on the extra data alone leads to an f1-score of 0.27, which is pretty good for that data. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Incorporation of extra training examples
Am 11.07.2012 10:11, schrieb Olivier Grisel: LinearSVC is based on the liblinear C++ library which AFAIK does not support sample weight. Well, that's true. You should better have a look at SGDClassifier: http://scikit-learn.org/stable/modules/sgd.html I have already tried approaches like SGDC or Multinomial Naive Bayes. I can improve these two classifiers with sample weighting, but the thing is that LinearSVC without the incorporated data still outperforms the other algorithms. But I guess I will play around a bit more ;) -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Incorporation of extra training examples
Hey! I am currently doing text classification. I have the following setup: 78 classes max 1500 train examples per class overall around 90.000 train examples same amount of test examples I am pretty happy with the classification results (~52% f1 score) which is fine for my task. But now I have another scenario. I have around 2.000.000 extra training examples available which are produced by a certain amount of users not _directly_ corresponding for the classes but I still know the labels of this data. If I train the classifier simply on this extra data (without the correct one) I can achieve a F1-score of ~25%. So this somehow tells me that there is information available that I now somehow want to incorporate to my existing data. For some few classes this data even works slightly better or at least similar. I have simply tried to combine both datasets (90.000 + 2.000.000) but this makes the results worse (test data amount always stays the same). This is not surprising because a lot of noise is added to the data and I think that the huge amount of extra data somehow overrules the existing one. My question now is, how I can incorporate this data the best in order to achieve better classification results than with my first dataset. Maybe someone has an idea or there are some techniques for that. Just for the record: I use Tf-Idf with a SVC which works best. I have also tried a different approach using topic models. Thanks and many regards, Philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Incorporation of extra training examples
Am 09.07.2012 13:59, schrieb Vlad Niculae: Another (hackish) idea to try would be to keep the labels of the extra data bit give it a sample_weight low enough not to override your good training data. That's actually a great and simple idea. Would I do that similar to that example: http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html So like using a 10 times higher weight for the corresponding samples for example as a starting point? I see that the fit method of LinearSVC doesn't have a sample_weight parameter. So I guess I would have switch to another method. SVC unfortunaetly has a very long runtime compared to LinearSVC, but maybe a SGDClassifier would work. Regards, Philipp On 09.07.2012, at 12:43, Philipp Singerkill...@gmail.com wrote: Hey! I am currently doing text classification. I have the following setup: 78 classes max 1500 train examples per class overall around 90.000 train examples same amount of test examples I am pretty happy with the classification results (~52% f1 score) which is fine for my task. But now I have another scenario. I have around 2.000.000 extra training examples available which are produced by a certain amount of users not _directly_ corresponding for the classes but I still know the labels of this data. If I train the classifier simply on this extra data (without the correct one) I can achieve a F1-score of ~25%. So this somehow tells me that there is information available that I now somehow want to incorporate to my existing data. For some few classes this data even works slightly better or at least similar. I have simply tried to combine both datasets (90.000 + 2.000.000) but this makes the results worse (test data amount always stays the same). This is not surprising because a lot of noise is added to the data and I think that the huge amount of extra data somehow overrules the existing one. My question now is, how I can incorporate this data the best in order to achieve better classification results than with my first dataset. Maybe someone has an idea or there are some techniques for that. Just for the record: I use Tf-Idf with a SVC which works best. I have also tried a different approach using topic models. Thanks and many regards, Philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Incorporation of extra training examples
Am 09.07.2012 13:47, schrieb Peter Prettenhofer: Hi, Hey! some quick thoughts: - if you use a multinomial Naive Bayes classifier (aka a language model) you can fit a background model on the large dataset and use that to smooth the model fitted on the smaller dataset. That's a nice idea. Is there a simple way to try this out fast in scikit-learn? - you should look at the domain adaptation / multi-task learning literature - this might fit your setting better than traditional semi-supervised learning. Thanks, I will look into that. best, Peter Regards, Philipp 2012/7/9 Gael Varoquauxgael.varoqu...@normalesup.org: Hi, You can try setting this as a semi-supervised learning problem and using label propagation: http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation HTH, G -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation
In terms of accuracy. Runtime is not the problem. Philipp Am 01.06.2012 18:58, schrieb Andreas Mueller: Hi Philipp. Do you mean it performs worse in terms of accuracy or in terms of runtime? Cheers, Andy Am 01.06.2012 18:57, schrieb Philipp Singer: Hey! So I havew tried it adding epsilon to my entries. My first intuition was that it performs pretty similar to my old dense version. But apparently I jsut hopped into cases where this method performs much worse :( Any hints on that? Regards, Philipp Am 30.05.2012 15:52, schrieb Andreas Mueller: Hi Philipp. The problem with using sparse matrices is that adding an epsilon would make them dense. I haven't really looked at it but I think it should somehow be possible to use this approximation also on sparse matrices. Cheers, Andy Am 30.05.2012 15:45, schrieb Philipp Singer: Hey Andy! Yep I am using it successfully ;) The idea with adding epsilon sounds legit. I will try it definitely out. I think it would be nice if you could add it to your code. Would make it also easier to work with sparse matrix. Regards, Philipp Hi Philipp. Great to hear that someone is using that :) The problem is that the approximation uses a log. Afaik even the exact kernel is not defined if two features are compared that are both exactly zeros. Usually I just work around that by adding an epsilon. I was considering adding that to the code. What do you think? Cheers, Andy -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Additive Chi2 Kernel Approximation
Hey Andy! Yep I am using it successfully ;) The idea with adding epsilon sounds legit. I will try it definitely out. I think it would be nice if you could add it to your code. Would make it also easier to work with sparse matrix. Regards, Philipp Hi Philipp. Great to hear that someone is using that :) The problem is that the approximation uses a log. Afaik even the exact kernel is not defined if two features are compared that are both exactly zeros. Usually I just work around that by adding an epsilon. I was considering adding that to the code. What do you think? Cheers, Andy -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Porter Stemmer
Hey! Is it possible to easly include stemming to text feature extraction in scikit-learn? I know that nltk has an implementation of the Porter stemmer, but I do not want to change my whole text feature extraction process to nltl if possible. Would be nice if I could include that somehow easyly. Thanks, Philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Classificator for probability features
Hey there! I am currently trying to classify a dataset which has the following format: Class1 0.3 0.5 0.2 Class2 0.9 0.1 0.0 ... So the features are probabilities that sum always up at exactly 1. I have tried several linear classifiers but I am now wondering if there is maybe some better way to classify such data and achieve better results. Maybe someone has some ideas. Thanks and regards, Philipp -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Classificator for probability features
Thanks a lot for the explanation. So do I see this right, that I would need to calculate for each pair of feature vectors the KL divergence? I have already tried to use a pipeline calculating an additive chi squared followed by a linear SVC. This boosts my results a bit. But I am still staying at an f1 score of 0.25 and I want to improve this if possible. Is this the right way to do this? Maybe there are some tweaks intended, like changing the parameters etc. Sorry for the dumb questions, but I haven't used on of these methods until now. Still excited to learn more about that ;) Regards, Philipp Am 14.05.2012 21:18, schrieb David Warde-Farley: On Mon, May 14, 2012 at 05:00:54PM +0200, Philipp Singer wrote: Thanks, that sounds really promising. Is there an implementation of KL divergence in scikit-learn? If so, how can I directly use that? I don't believe there is, but it's quite simple to do yourself. Many algorithms in scikit-learn can take a precomputed distance matrix. Given two points, p and q, on the simplex, the KL divergence between the two discrete distributions represented is simply (-p * np.log(p / q)).sum(). Note that this is in general not defined if they do not share the same support (i.e. if there is a zero at one spot in one but not in the other). In practice, if there are any zeros at all, you will need to deal with them clearly as the logarithm and/or the division will misbehave. Note that the grandparent's note that the KL divergence is not a metric is not a minor concern: the KL divergence, for example, is _not_ symmetric (KL(p, q) != KL(q, p)). You can of course take the average of KL(p, q) and KL(q, p) to symmetrize it, but you still may run into problems with algorithms that assume that distances obey the triangle inequality (KL divergences do not). Personally I would recommend trying Andy's suggestion re: an SVM with a chi-squared kernel. For small instances you can precompute the kernel matrix and pass it to SVC yourself. If you have a lot of data (or if you want to try it out quickly) the kernel approximations module plus a linear SVM is a good bet. David -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Text Documents - Vectorizer
Hey! I am currently using http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Vectorizer.htmlsklearn.feature_extraction.text.Vectorizer for feature extraction of text documents I have. I am now curious and don't quite understand how the TFIDF calculation is done. Is it done seperately for each document or based on all documents. It can't be done for each class of documents, because information about the labels is not available. Hope you can give me some explanations regarding this. Thanks! Philipp -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Text Documents - Vectorizer
The IDF statistics is computed once on the whole training corpus as passed to the `fit` method and then reused on each call to the `transform` method. For a train / test split on typically call fit_transform on the train split (to compute the IDF vector on the train split only) and reuse those IDF values on the test split by calling transform only: vec = TfidfVectorizer() tfidf_train = vec.fit_transform(documents_train) tfidf_test = vec.transform(documents_test) The TF-IDF feature extraction per-se is unsupervised (it does not need the labels). You can then train a supervised classifier on the output to use the class of the document and pipeline both to get a document classifier. The new documentation is here: http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction Here is a sample pipeline: http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html Alright, thanks for the ehads up. That's exactly the way I am using it. Okay, so the tfidf values are for the whole corpus. Wouldn't it make sense to just see documents belonging to one class as the corpus for the calculation? Regards, Philipp http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix
Am 15.01.2012 19:45, schrieb Gael Varoquaux: On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote: The problem is that my representation is very sparse so I have a huge amount of zeros. That's actually good: some of our estimators are able to use a sparse representation to speed up computation. Furthermore the dataset is skewed so one class takes a huge amount of labels and another one is also pretty high. I have successfully used logistic regression and I could achieve a recall of about (in the best case dataset) 65%. I am pretty happy with that result. But when looking at the confusion matrix the problem is that many examples get mapped to the large class. Use class_weight='auto' in the logistic regression to counter the effect of un-balanced classes. For SVMs, the following example shows the trick: http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html HTH, Gael -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general Thanks a lot for the help! This helped out quite a bit. But I am still not entirely happy with the results. Maybe some further ideas? Thanks a lot Philipp -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general