Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?
Although I note that I've got LaTeX compilation errors, so I'm not sure how Andy compiles this. On 16 April 2015 at 20:25, Joel Nothman joel.noth...@gmail.com wrote: I've proposed a better chapter ordering at https://github.com/scikit-learn/scikit-learn/pull/4602... On 16 April 2015 at 03:48, Andreas Mueller t3k...@gmail.com wrote: Hi. Yes, run make latexpdf in the doc folder. Best, Andy On 04/15/2015 01:11 PM, Tim wrote: Thanks, Andy! How do you generate the pdf file? Can I also do that? On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourceforge.net Date: Wednesday, April 15, 2015, 12:55 PM Hi Tim. There are pdfs for 0.16.0 and 0.16.1 up now at http://sourceforge.net/projects/scikit-learn/files/documentation/ Let us know if there are issues with it. Cheers, Andy On 04/15/2015 12:08 PM, Tim wrote: Hello, I am looking for a pdf file for the documentation for the latest stable scikit-learn i.e. 0.16.1. I followed http://scikit-learn.org/stable/support.html#documentation-resources, which leads me to http://sourceforge.net/projects/scikit-learn/files/documentation/, But the pdf files are for = 0.12 version and no later than 2012. Can the official team make the pdf files available? Thanks! -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?
I've proposed a better chapter ordering at https://github.com/scikit-learn/scikit-learn/pull/4602... On 16 April 2015 at 03:48, Andreas Mueller t3k...@gmail.com wrote: Hi. Yes, run make latexpdf in the doc folder. Best, Andy On 04/15/2015 01:11 PM, Tim wrote: Thanks, Andy! How do you generate the pdf file? Can I also do that? On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourceforge.net Date: Wednesday, April 15, 2015, 12:55 PM Hi Tim. There are pdfs for 0.16.0 and 0.16.1 up now at http://sourceforge.net/projects/scikit-learn/files/documentation/ Let us know if there are issues with it. Cheers, Andy On 04/15/2015 12:08 PM, Tim wrote: Hello, I am looking for a pdf file for the documentation for the latest stable scikit-learn i.e. 0.16.1. I followed http://scikit-learn.org/stable/support.html#documentation-resources, which leads me to http://sourceforge.net/projects/scikit-learn/files/documentation/, But the pdf files are for = 0.12 version and no later than 2012. Can the official team make the pdf files available? Thanks! -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?
Thanks again! Can your scripts also create pdf bookmarks of third or lower levels? E.g. ... 4.1.1 Ordinary Least Squares 4.1.2 Ridge Regression Ridge Complexity Setting the regularization parameter: generalized Cross-Validation 4.1.3 Lasso Setting regularization parameter Using cross-validation Information-criteria based model selection 4.1.4 Elastic Net ... Can we also show the numerical numberings in pdf bookmarks? E.g. 4.1 Generalized Linear Models versus Generalized Linear Models On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourceforge.net Date: Wednesday, April 15, 2015, 1:48 PM Hi. Yes, run make latexpdf in the doc folder. Best, Andy On 04/15/2015 01:11 PM, Tim wrote: Thanks, Andy! How do you generate the pdf file? Can I also do that? On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourceforge.net Date: Wednesday, April 15, 2015, 12:55 PM Hi Tim. There are pdfs for 0.16.0 and 0.16.1 up now at http://sourceforge.net/projects/scikit-learn/files/documentation/ Let us know if there are issues with it. Cheers, Andy On 04/15/2015 12:08 PM, Tim wrote: Hello, I am looking for a pdf file for the documentation for the latest stable scikit-learn i.e. 0.16.1. I followed http://scikit-learn.org/stable/support.html#documentation-resources, which leads me to http://sourceforge.net/projects/scikit-learn/files/documentation/, But the pdf files are for = 0.12 version and no later than 2012. Can the official team make the pdf files available? Thanks! -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM
Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?
This is the sphinx latex build, not a script of ours. I'm not sure, you can consult the sphinx documentation: http://sphinx-doc.org/ On 04/16/2015 07:48 AM, Tim wrote: Thanks again! Can your scripts also create pdf bookmarks of third or lower levels? E.g. ... 4.1.1 Ordinary Least Squares 4.1.2 Ridge Regression Ridge Complexity Setting the regularization parameter: generalized Cross-Validation 4.1.3 Lasso Setting regularization parameter Using cross-validation Information-criteria based model selection 4.1.4 Elastic Net ... Can we also show the numerical numberings in pdf bookmarks? E.g. 4.1 Generalized Linear Models versus Generalized Linear Models On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourceforge.net Date: Wednesday, April 15, 2015, 1:48 PM Hi. Yes, run make latexpdf in the doc folder. Best, Andy On 04/15/2015 01:11 PM, Tim wrote: Thanks, Andy! How do you generate the pdf file? Can I also do that? On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourceforge.net Date: Wednesday, April 15, 2015, 12:55 PM Hi Tim. There are pdfs for 0.16.0 and 0.16.1 up now at http://sourceforge.net/projects/scikit-learn/files/documentation/ Let us know if there are issues with it. Cheers, Andy On 04/15/2015 12:08 PM, Tim wrote: Hello, I am looking for a pdf file for the documentation for the latest stable scikit-learn i.e. 0.16.1. I followed http://scikit-learn.org/stable/support.html#documentation-resources, which leads me to http://sourceforge.net/projects/scikit-learn/files/documentation/, But the pdf files are for = 0.12 version and no later than 2012. Can the official team make the pdf files available? Thanks! -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?
Interestingly, this time I didn't get any errors (I got them before). But you get a pdf even with the errors. On 04/16/2015 06:26 AM, Joel Nothman wrote: Although I note that I've got LaTeX compilation errors, so I'm not sure how Andy compiles this. On 16 April 2015 at 20:25, Joel Nothman joel.noth...@gmail.com mailto:joel.noth...@gmail.com wrote: I've proposed a better chapter ordering at https://github.com/scikit-learn/scikit-learn/pull/4602... On 16 April 2015 at 03:48, Andreas Mueller t3k...@gmail.com mailto:t3k...@gmail.com wrote: Hi. Yes, run make latexpdf in the doc folder. Best, Andy On 04/15/2015 01:11 PM, Tim wrote: Thanks, Andy! How do you generate the pdf file? Can I also do that? On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com mailto:t3k...@gmail.com wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourceforge.net mailto:scikit-learn-general@lists.sourceforge.net Date: Wednesday, April 15, 2015, 12:55 PM Hi Tim. There are pdfs for 0.16.0 and 0.16.1 up now at http://sourceforge.net/projects/scikit-learn/files/documentation/ Let us know if there are issues with it. Cheers, Andy On 04/15/2015 12:08 PM, Tim wrote: Hello, I am looking for a pdf file for the documentation for the latest stable scikit-learn i.e. 0.16.1. I followed http://scikit-learn.org/stable/support.html#documentation-resources, which leads me to http://sourceforge.net/projects/scikit-learn/files/documentation/, But the pdf files are for = 0.12 version and no later than 2012. Can the official team make the pdf files available? Thanks! -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- BPM Camp - Free Virtual Workshop May 6th at
Re: [Scikit-learn-general] Robust PCA
How about something like this: 1. Basic implementation of ALM uses arpack (not ideal but it means sklearn can have RPCA available) 2. Option to use randomized SVD if desired 3. Option to use propack if desired and it's available (or if/when scipy begins to use it) 4. GoDec implementation for low rank + sparse + noise On Wed, Apr 15, 2015 at 4:06 PM, scikit-learn-general-requ...@lists.sourceforge.net wrote: Send Scikit-learn-general mailing list submissions to scikit-learn-general@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/scikit-learn-general or, via email, send a message with subject or body 'help' to scikit-learn-general-requ...@lists.sourceforge.net You can reach the person managing the list at scikit-learn-general-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Scikit-learn-general digest... Today's Topics: 1. Re: Scikit-learn-general Digest, Vol 63, Issue 34 (Alex Papanicolaou) 2. Re: Robust PCA (Olivier Grisel) 3. Re: Robust PCA (Kyle Kastner) 4. Re: Robust PCA (Yogesh Karpate) 5. Re: Performance of LSHForest (Joel Nothman) -- Message: 1 Date: Wed, 15 Apr 2015 11:22:17 -0700 From: Alex Papanicolaou alex.papa...@gmail.com Subject: Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 63, Issue 34 To: scikit-learn-general@lists.sourceforge.net Message-ID: CAGNPn4qTmTXOgpLX= ziqapuv5b29iecvfrfwpo96rnectww...@mail.gmail.com Content-Type: text/plain; charset=utf-8 Kyle Andreas, Here is my github repo: https://github.com/apapanico/RPCA Responses: 1. I didn't make the GSoC suggestion a few years (also not a student anymore :-(, just using RPCA for work), I just came across it in a google search when trying to find python implementations. 2. As for GoDec, I have not poked around with it but I would like to. I had intended to use this as a starting point: https://sites.google.com/site/godecomposition/home But yea, it sounds like it can go much bigger. But if I'm not mistaken, it's technically a different problem (low rank + sparse + noise). 3. Regarding PROPACK, the main routine needed is lansvd which implements Lanczos bidiagonalization with partial reorthogonalization. I do not know what else that depends on. I also do not know if there's an implementation in C which would be preferred, obviously. A routine for computing only top-k singular triplets is pretty key for making Candes' ALM method as efficient as possible. Along these lines, I started out using the randomized SVD from sklearn but I was failing my tests generated with the original Matlab code so I switched to numpy svd and then finally svdp in pypropack. Cheers, Alex -- next part -- An HTML attachment was scrubbed... -- Message: 2 Date: Wed, 15 Apr 2015 15:40:33 -0400 From: Olivier Grisel olivier.gri...@ensta.org Subject: Re: [Scikit-learn-general] Robust PCA To: scikit-learn-general scikit-learn-general@lists.sourceforge.net Message-ID: CAFvE7K60pn7-YP7rfreFU932on8omr7=q8-Vxsf0a+= v_nt...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 We could use PyPROPACK if it was contributed upstream in scipy ;) I know that some scipy maintainers don't appreciate arpack much and would like to see it replaced (or at least completed with propack). -- Olivier -- Message: 3 Date: Wed, 15 Apr 2015 15:51:01 -0400 From: Kyle Kastner kastnerk...@gmail.com Subject: Re: [Scikit-learn-general] Robust PCA To: scikit-learn-general@lists.sourceforge.net Message-ID: CAGNZ19AqUxUV3So_pQ2vn=hDQzMkD4Wgodm6uwTUWAZbomx=_ g...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 IF it was in scipy would it be backported to the older versions? How would we handle that? On Wed, Apr 15, 2015 at 3:40 PM, Olivier Grisel olivier.gri...@ensta.org wrote: We could use PyPROPACK if it was contributed upstream in scipy ;) I know that some scipy maintainers don't appreciate arpack much and would like to see it replaced (or at least completed with propack). -- Olivier -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net
Re: [Scikit-learn-general] Performance of LSHForest
Hi Joel, To extend your analysis: - when n_samples*n_indices is large enough, the bottleneck is the use of the index, as you say. - when n_dimensions*n_candidates is large enough, the bottleneck is computation of true distances between DB points and the query. To serve well both kinds of use cases is perfectly possible, but requires use of the index that is both: A) Fast B) Uses the index optimally to reduce the number of candidates for which we compare distances. Here is a variant of your proposal (better keep track of context) that also requires a little Cython but improves both aspects A and B and reduces code complexity. Observation I: Only a single binary search per index is necessary, the first. After we find the correct location for the query binary code, we can restrict ourselves to the n_candidates (or even fewer) before and after that location. So no further binary searches are necessary at all, and the restriction to a small linear part of the array should be much more cache friendly. This makes full use of our array implementation of orderedcollection, instead of acting as if we were still on a binary tree implementation as in the original LSH-Forest paper. There is a price to pay for this simplification: we are now looking at (computing full distance from query for) 2*n_candidates*n_indices points, which can be expensive (we improved A at a cost to B). But here is where some Cython can be really useful. Observation II: The best information we can extract from the binary representation is not the distances in the tree structure, but hamming distances to the query. So after the restriction of I, compute the *hamming distances* of the 2*n_candidate*n_indices points each from the binary representation of the query (corresponding to the appropriate index). Then compute full metric only for the n_candidates with the lowest hamming distances. This should achieve a pretty good sweet spot of performance, with just a bit of Cython. Daniel On 04/16/2015 12:18 AM, Joel Nothman wrote: Once we're dealing with large enough index and n_candidates, most time is spent in searchsorted in the synchronous ascending phase, while any overhead around it is marginal. Currently we are searching over the whole array in each searchsorted, while it could be rewritten to keep better track of context to cut down the overall array when searching. While possible, I suspect this will look confusing in Python/Numpy, and Cython will be a clearer and faster way to present this logic. On the other hand, time spent in _compute_distances is substantial, and yet most of its consumption is /outside/ of pairwise_distances. This commit https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a cuts a basic benchmark from 85 to 70 seconds. Vote here for merge https://github.com/scikit-learn/scikit-learn/pull/4603! On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote: Moreover, this drawback occurs because LSHForest does not vectorize multiple queries as in 'ball_tree' or any other method. This slows the exact neighbor distance calculation down significantly after approximation. This will not be a problem if queries are for individual points. Unfortunately, former is the more useful usage of LSHForest. Are you trying individual queries or multiple queries (for n_samples)? On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com wrote: LHSForest is not intended for dimensions at which exact methods work well, nor for tiny datasets. Try d500, n_points10, I don't remember the switchover point. The documentation should make this clear, but unfortunately I don't see that it does. On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com mailto:joel.noth...@gmail.com wrote: I agree this is disappointing, and we need to work on making LSHForest faster. Portions should probably be coded in Cython, for instance, as the current implementation is a bit circuitous in order to work in numpy. PRs are welcome. LSHForest could use parallelism to be faster, but so can (and will) the exact neighbors methods. In theory in LSHForest, each tree could be stored on entirely different machines, providing memory benefits, but scikit-learn can't really take advantage of this. Having said that, I would also try with higher n_features and n_queries. We have to limit the scale of our examples in order to limit the overall document compilation time. On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com mailto:mbatchka...@gmail.com wrote:
Re: [Scikit-learn-general] gradient boost classifier - feature_importances_
never mind my question. I forgot gridsearch was the actual object. Thanks, From: Pagliari, Roberto [rpagli...@appcomsci.com] Sent: Thursday, April 16, 2015 12:50 PM To: scikit-learn-general@lists.sourceforge.net Subject: [Scikit-learn-general] gradient boost classifier - feature_importances_ is feature_importances_ available from gradient boosting? it is mentioned in the documentation, but it doesn't exist when I try to access it (after fitting via grid search). I printed 'dir' of the object and can't see it. Thanks, -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Performance of LSHForest
I more or less agree. Certainly we only need to do one searchsorted per query per tree, and then do linear scans. There is a question of how close we stay to the original LSHForest algorithm, which relies on matching prefixes rather than hamming distance. Hamming distance is easier to calculate in NumPy and is probably faster to calculate in C too (with or without using POPCNT). Perhaps the only advantage of using Cython in your solution is to avoid the memory overhead of unpackbits. However, n_candidates before and after is arguably not sufficient if one side has more than n_candidates with a high prefix overlap. But until we look at the suffixes we can't know if it is closer or farther in hamming distance. I also think the use of n_candidates in the current code is somewhat broken, as suggested by my XXX comment in _get_candidates, which we discussed but did not resolve clearly. I think it will be hard to make improvements of this sort without breaking the current results and parameter sensitivities of the implementation. On 17 April 2015 at 00:16, Daniel Vainsencher daniel.vainsenc...@gmail.com wrote: Hi Joel, To extend your analysis: - when n_samples*n_indices is large enough, the bottleneck is the use of the index, as you say. - when n_dimensions*n_candidates is large enough, the bottleneck is computation of true distances between DB points and the query. To serve well both kinds of use cases is perfectly possible, but requires use of the index that is both: A) Fast B) Uses the index optimally to reduce the number of candidates for which we compare distances. Here is a variant of your proposal (better keep track of context) that also requires a little Cython but improves both aspects A and B and reduces code complexity. Observation I: Only a single binary search per index is necessary, the first. After we find the correct location for the query binary code, we can restrict ourselves to the n_candidates (or even fewer) before and after that location. So no further binary searches are necessary at all, and the restriction to a small linear part of the array should be much more cache friendly. This makes full use of our array implementation of orderedcollection, instead of acting as if we were still on a binary tree implementation as in the original LSH-Forest paper. There is a price to pay for this simplification: we are now looking at (computing full distance from query for) 2*n_candidates*n_indices points, which can be expensive (we improved A at a cost to B). But here is where some Cython can be really useful. Observation II: The best information we can extract from the binary representation is not the distances in the tree structure, but hamming distances to the query. So after the restriction of I, compute the *hamming distances* of the 2*n_candidate*n_indices points each from the binary representation of the query (corresponding to the appropriate index). Then compute full metric only for the n_candidates with the lowest hamming distances. This should achieve a pretty good sweet spot of performance, with just a bit of Cython. Daniel On 04/16/2015 12:18 AM, Joel Nothman wrote: Once we're dealing with large enough index and n_candidates, most time is spent in searchsorted in the synchronous ascending phase, while any overhead around it is marginal. Currently we are searching over the whole array in each searchsorted, while it could be rewritten to keep better track of context to cut down the overall array when searching. While possible, I suspect this will look confusing in Python/Numpy, and Cython will be a clearer and faster way to present this logic. On the other hand, time spent in _compute_distances is substantial, and yet most of its consumption is /outside/ of pairwise_distances. This commit https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a cuts a basic benchmark from 85 to 70 seconds. Vote here for merge https://github.com/scikit-learn/scikit-learn/pull/4603! On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote: Moreover, this drawback occurs because LSHForest does not vectorize multiple queries as in 'ball_tree' or any other method. This slows the exact neighbor distance calculation down significantly after approximation. This will not be a problem if queries are for individual points. Unfortunately, former is the more useful usage of LSHForest. Are you trying individual queries or multiple queries (for n_samples)? On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com wrote: LHSForest is not intended for dimensions at which exact methods work well, nor for tiny datasets. Try d500, n_points10, I don't remember the switchover point.
Re: [Scikit-learn-general] Performance of LSHForest
On 04/16/2015 05:49 PM, Joel Nothman wrote: I more or less agree. Certainly we only need to do one searchsorted per query per tree, and then do linear scans. There is a question of how close we stay to the original LSHForest algorithm, which relies on matching prefixes rather than hamming distance. Hamming distance is easier to calculate in NumPy and is probably faster to calculate in C too (with or without using POPCNT). Perhaps the only advantage of using Cython in your solution is to avoid the memory overhead of unpackbits. You obviously know more than I do about Cython vs numpy options. However, n_candidates before and after is arguably not sufficient if one side has more than n_candidates with a high prefix overlap. I disagree. Being able to look at 2*n_candidates that must contain n_candidates of the closest ones, rather than as many as happen to agree on x number of bits is a feature, not a bug. Especially if we then filter them by hamming distance. But until we look at the suffixes we can't know if it is closer or farther in hamming distance. I also think the use of n_candidates in the current code is somewhat broken, as suggested by my XXX comment in _get_candidates, which we discussed but did not resolve clearly. I think n_candidates embodies a reasonable desire to be able to set the investment per query, but there are many ways to do that. Since the research does not provide very good advice (I think the guarantees require looking at sqrt(db size) candidates for every query), it leaves the field wide open for different schemes. Should the number of distances calculated be per number of indices, or fixed? etc. I am proposing a particular definition: - Read 2*n_candidates*n_indices hashes - Calculate n_candidates full distances. The motivation for the first is that it allows us to get all n_candidates from the same side of the same index if that is where the good stuff (hamming distance-wise) is, and seems not-too-expensive in calculating hamming distances. But if someone argues we should calculate 10x as many hamming distances to decide what distances to compute, I don't know a very good argument one way or the other. I think it will be hard to make improvements of this sort without breaking the current results and parameter sensitivities of the implementation. You seem to be assuming it is tuned; I am not even sure there exists a precise sense in which it is tunable, except for a particular dataset (and that is not very useful) :) Daniel On 17 April 2015 at 00:16, Daniel Vainsencher daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com wrote: Hi Joel, To extend your analysis: - when n_samples*n_indices is large enough, the bottleneck is the use of the index, as you say. - when n_dimensions*n_candidates is large enough, the bottleneck is computation of true distances between DB points and the query. To serve well both kinds of use cases is perfectly possible, but requires use of the index that is both: A) Fast B) Uses the index optimally to reduce the number of candidates for which we compare distances. Here is a variant of your proposal (better keep track of context) that also requires a little Cython but improves both aspects A and B and reduces code complexity. Observation I: Only a single binary search per index is necessary, the first. After we find the correct location for the query binary code, we can restrict ourselves to the n_candidates (or even fewer) before and after that location. So no further binary searches are necessary at all, and the restriction to a small linear part of the array should be much more cache friendly. This makes full use of our array implementation of orderedcollection, instead of acting as if we were still on a binary tree implementation as in the original LSH-Forest paper. There is a price to pay for this simplification: we are now looking at (computing full distance from query for) 2*n_candidates*n_indices points, which can be expensive (we improved A at a cost to B). But here is where some Cython can be really useful. Observation II: The best information we can extract from the binary representation is not the distances in the tree structure, but hamming distances to the query. So after the restriction of I, compute the *hamming distances* of the 2*n_candidate*n_indices points each from the binary representation of the query (corresponding to the appropriate index). Then compute full metric only for the n_candidates with the lowest hamming distances. This should achieve a pretty good sweet spot of performance, with just a bit of Cython. Daniel On 04/16/2015 12:18 AM, Joel Nothman wrote: Once we're dealing with large enough index and n_candidates, most time is spent in searchsorted in the
Re: [Scikit-learn-general] Robust PCA
GoDec might not have the citations (yet) to be added to scikit-learn. But I think a basic ALM based RPCA would be a great addition, along with a cool demo. Background smart background subtraction would be my vote but might be too heavy weight - I could see a cool example of something like colored bouncing balls overlaid on the china picture that is built in for sklearn. On Thu, Apr 16, 2015 at 1:18 PM, Alex Papanicolaou alex.papa...@gmail.com wrote: How about something like this: 1. Basic implementation of ALM uses arpack (not ideal but it means sklearn can have RPCA available) 2. Option to use randomized SVD if desired 3. Option to use propack if desired and it's available (or if/when scipy begins to use it) 4. GoDec implementation for low rank + sparse + noise On Wed, Apr 15, 2015 at 4:06 PM, scikit-learn-general-requ...@lists.sourceforge.net wrote: Send Scikit-learn-general mailing list submissions to scikit-learn-general@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/scikit-learn-general or, via email, send a message with subject or body 'help' to scikit-learn-general-requ...@lists.sourceforge.net You can reach the person managing the list at scikit-learn-general-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Scikit-learn-general digest... Today's Topics: 1. Re: Scikit-learn-general Digest, Vol 63, Issue 34 (Alex Papanicolaou) 2. Re: Robust PCA (Olivier Grisel) 3. Re: Robust PCA (Kyle Kastner) 4. Re: Robust PCA (Yogesh Karpate) 5. Re: Performance of LSHForest (Joel Nothman) -- Message: 1 Date: Wed, 15 Apr 2015 11:22:17 -0700 From: Alex Papanicolaou alex.papa...@gmail.com Subject: Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 63, Issue 34 To: scikit-learn-general@lists.sourceforge.net Message-ID: CAGNPn4qTmTXOgpLX=ziqapuv5b29iecvfrfwpo96rnectww...@mail.gmail.com Content-Type: text/plain; charset=utf-8 Kyle Andreas, Here is my github repo: https://github.com/apapanico/RPCA Responses: 1. I didn't make the GSoC suggestion a few years (also not a student anymore :-(, just using RPCA for work), I just came across it in a google search when trying to find python implementations. 2. As for GoDec, I have not poked around with it but I would like to. I had intended to use this as a starting point: https://sites.google.com/site/godecomposition/home But yea, it sounds like it can go much bigger. But if I'm not mistaken, it's technically a different problem (low rank + sparse + noise). 3. Regarding PROPACK, the main routine needed is lansvd which implements Lanczos bidiagonalization with partial reorthogonalization. I do not know what else that depends on. I also do not know if there's an implementation in C which would be preferred, obviously. A routine for computing only top-k singular triplets is pretty key for making Candes' ALM method as efficient as possible. Along these lines, I started out using the randomized SVD from sklearn but I was failing my tests generated with the original Matlab code so I switched to numpy svd and then finally svdp in pypropack. Cheers, Alex -- next part -- An HTML attachment was scrubbed... -- Message: 2 Date: Wed, 15 Apr 2015 15:40:33 -0400 From: Olivier Grisel olivier.gri...@ensta.org Subject: Re: [Scikit-learn-general] Robust PCA To: scikit-learn-general scikit-learn-general@lists.sourceforge.net Message-ID: CAFvE7K60pn7-YP7rfreFU932on8omr7=q8-Vxsf0a+=v_nt...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 We could use PyPROPACK if it was contributed upstream in scipy ;) I know that some scipy maintainers don't appreciate arpack much and would like to see it replaced (or at least completed with propack). -- Olivier -- Message: 3 Date: Wed, 15 Apr 2015 15:51:01 -0400 From: Kyle Kastner kastnerk...@gmail.com Subject: Re: [Scikit-learn-general] Robust PCA To: scikit-learn-general@lists.sourceforge.net Message-ID: CAGNZ19AqUxUV3So_pQ2vn=hDQzMkD4Wgodm6uwTUWAZbomx=_...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 IF it was in scipy would it be backported to the older versions? How would we handle that? On Wed, Apr 15, 2015 at 3:40 PM, Olivier Grisel olivier.gri...@ensta.org wrote: We could use PyPROPACK if it was contributed upstream in scipy ;) I know that some scipy maintainers don't appreciate arpack much and would like to see it replaced (or at least completed with propack). -- Olivier -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your