[Scikit-learn-general] Cython profiling question
Hey all. A slightly off-topic question about cython profiling. I'm pretty sure I could use yep for profiling, as mentioned in the docs: http://scikit-learn.org/dev/developers/performance.html#profiling-compiled-extensions and get line-by-line counts. However I did not manage to do that recently. I ususally got the cython gcc commands and removed the NDEBUG and added -gp. The docs also say I need a debug version of python and numpy, but I'm pretty sure I never did that before. Can anyone confirm or deny how to do this? If I figure out a good way I'm totally volunteering to add it to the docs ;) Cheers, Andy -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Cython profiling question
2014-07-16 16:43 GMT+02:00 Andy t3k...@gmail.com: I'm pretty sure I could use yep for profiling, as mentioned in the docs: http://scikit-learn.org/dev/developers/performance.html#profiling-compiled-extensions and get line-by-line counts. However I did not manage to do that recently. I ususally got the cython gcc commands and removed the NDEBUG and added -gp. The docs also say I need a debug version of python and numpy, but I'm pretty sure I never did that before. sudo apt-get install python-{numpy,scipy}-dbg I must admit I never did this; I use Google Perftools. -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Cython profiling question
On 07/16/2014 05:17 PM, Lars Buitinck wrote: 2014-07-16 16:43 GMT+02:00 Andy t3k...@gmail.com: I'm pretty sure I could use yep for profiling, as mentioned in the docs: http://scikit-learn.org/dev/developers/performance.html#profiling-compiled-extensions and get line-by-line counts. However I did not manage to do that recently. I ususally got the cython gcc commands and removed the NDEBUG and added -gp. The docs also say I need a debug version of python and numpy, but I'm pretty sure I never did that before. sudo apt-get install python-{numpy,scipy}-dbg I must admit I never did this; I use Google Perftools. That is using google perftools. So you get line-by-line with google perftools without using debugging versions? How? -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Cython profiling question
2014-07-16 17:29 GMT+02:00 Andy t3k...@gmail.com: That is using google perftools. I thought you were referring to the bit about gprof. So you get line-by-line with google perftools without using debugging versions? How? I don't, I look at per-function cost. -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Cython profiling question
now way to really get line by line without some sort of debug afaik, it throws the info away at compile time On Wed, Jul 16, 2014 at 11:44 AM, Lars Buitinck larsm...@gmail.com wrote: 2014-07-16 17:29 GMT+02:00 Andy t3k...@gmail.com: That is using google perftools. I thought you were referring to the bit about gprof. So you get line-by-line with google perftools without using debugging versions? How? I don't, I look at per-function cost. -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15
Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before. Prior 0.15, this worked: lb = LabelBinarizer() lb.fit_transform(['a', 'b', 'c']) array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) lb.transform(['a', 'd', 'e']) array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) Note that both values 'd' and 'e', having never been seen while the LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by transform, which I interpreted and used as an unknown class, useful in cases where your test data can contain values which could not be known in advance, i.e. at the time of training (and also, to avoid data leakage while doing cross-validation). With 0.15, the same code now gives the error: [...] ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd' 'e']found in the data I wrote about this question a couple of months ago, in regard of a similar issue with the LabelEncoder: http://sourceforge.net/p/scikit-learn/mailman/message/31827616/ So if my understanding of this mechanism is correct (as well as my assumptions about the way it is/should be used), would it make sense to add something like a map_unknowns_to_single_class extra parameter to all the preprocessing encoders, so that this behaviour can be at least implemented optionally? Thanks, Christian -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15
Hi This looks like a regression. Can you open an issue on github? I am not sure that it would make sense to add a unknown columns label with an optional parameter. But you could easily add one with some numpy operations np.hstack([y, y.sum(axis=1,keepdims=True) == 0]) Best regards, Arnaud On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote: Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before. Prior 0.15, this worked: lb = LabelBinarizer() lb.fit_transform(['a', 'b', 'c']) array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) lb.transform(['a', 'd', 'e']) array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) Note that both values 'd' and 'e', having never been seen while the LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by transform, which I interpreted and used as an unknown class, useful in cases where your test data can contain values which could not be known in advance, i.e. at the time of training (and also, to avoid data leakage while doing cross-validation). With 0.15, the same code now gives the error: [...] ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd' 'e']found in the data I wrote about this question a couple of months ago, in regard of a similar issue with the LabelEncoder: http://sourceforge.net/p/scikit-learn/mailman/message/31827616/ So if my understanding of this mechanism is correct (as well as my assumptions about the way it is/should be used), would it make sense to add something like a map_unknowns_to_single_class extra parameter to all the preprocessing encoders, so that this behaviour can be at least implemented optionally? Thanks, Christian -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15
I can open an issue, but on the other hand, you could argue that the new behaviour is now at least consistent with the other encoder types, e.g.: le = LabelEncoder() le.fit_transform(['a', 'b', 'c']) array([0, 1, 2]) le.transform(['a', 'd', 'e']) [...] ValueError: y contains new labels: ['d' 'e'] I'm curious to know why you don't think an option to map unknown values to a single extra class would make sense? It makes me think that I'm perhaps not using the encoders in the intended way, but it would seem that this is a pretty common use case no? And if so, how do you typically deal with the issue of unknown (or never seen in training) labels? On 16 July 2014 17:13, Arnaud Joly a.j...@ulg.ac.be wrote: Hi This looks like a regression. Can you open an issue on github? I am not sure that it would make sense to add a unknown columns label with an optional parameter. But you could easily add one with some numpy operations np.hstack([y, y.sum(axis=1,keepdims=True) == 0]) Best regards, Arnaud On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote: Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before. Prior 0.15, this worked: lb = LabelBinarizer() lb.fit_transform(['a', 'b', 'c']) array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) lb.transform(['a', 'd', 'e']) array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) Note that both values 'd' and 'e', having never been seen while the LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by transform, which I interpreted and used as an unknown class, useful in cases where your test data can contain values which could not be known in advance, i.e. at the time of training (and also, to avoid data leakage while doing cross-validation). With 0.15, the same code now gives the error: [...] ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd' 'e']found in the data I wrote about this question a couple of months ago, in regard of a similar issue with the LabelEncoder: http://sourceforge.net/p/scikit-learn/mailman/message/31827616/ So if my understanding of this mechanism is correct (as well as my assumptions about the way it is/should be used), would it make sense to add something like a map_unknowns_to_single_class extra parameter to all the preprocessing encoders, so that this behaviour can be at least implemented optionally? Thanks, Christian -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15
cf. https://github.com/scikit-learn/scikit-learn/pull/3243 On 17 July 2014 08:59, Christian Jauvin cjau...@gmail.com wrote: I can open an issue, but on the other hand, you could argue that the new behaviour is now at least consistent with the other encoder types, e.g.: le = LabelEncoder() le.fit_transform(['a', 'b', 'c']) array([0, 1, 2]) le.transform(['a', 'd', 'e']) [...] ValueError: y contains new labels: ['d' 'e'] I'm curious to know why you don't think an option to map unknown values to a single extra class would make sense? It makes me think that I'm perhaps not using the encoders in the intended way, but it would seem that this is a pretty common use case no? And if so, how do you typically deal with the issue of unknown (or never seen in training) labels? On 16 July 2014 17:13, Arnaud Joly a.j...@ulg.ac.be wrote: Hi This looks like a regression. Can you open an issue on github? I am not sure that it would make sense to add a unknown columns label with an optional parameter. But you could easily add one with some numpy operations np.hstack([y, y.sum(axis=1,keepdims=True) == 0]) Best regards, Arnaud On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote: Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before. Prior 0.15, this worked: lb = LabelBinarizer() lb.fit_transform(['a', 'b', 'c']) array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) lb.transform(['a', 'd', 'e']) array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) Note that both values 'd' and 'e', having never been seen while the LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by transform, which I interpreted and used as an unknown class, useful in cases where your test data can contain values which could not be known in advance, i.e. at the time of training (and also, to avoid data leakage while doing cross-validation). With 0.15, the same code now gives the error: [...] ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd' 'e']found in the data I wrote about this question a couple of months ago, in regard of a similar issue with the LabelEncoder: http://sourceforge.net/p/scikit-learn/mailman/message/31827616/ So if my understanding of this mechanism is correct (as well as my assumptions about the way it is/should be used), would it make sense to add something like a map_unknowns_to_single_class extra parameter to all the preprocessing encoders, so that this behaviour can be at least implemented optionally? Thanks, Christian -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15
Relevant to this: https://github.com/scikit-learn/scikit-learn/pull/3243 Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Wed, Jul 16, 2014 at 6:59 PM, Christian Jauvin cjau...@gmail.com wrote: I can open an issue, but on the other hand, you could argue that the new behaviour is now at least consistent with the other encoder types, e.g.: le = LabelEncoder() le.fit_transform(['a', 'b', 'c']) array([0, 1, 2]) le.transform(['a', 'd', 'e']) [...] ValueError: y contains new labels: ['d' 'e'] I'm curious to know why you don't think an option to map unknown values to a single extra class would make sense? It makes me think that I'm perhaps not using the encoders in the intended way, but it would seem that this is a pretty common use case no? And if so, how do you typically deal with the issue of unknown (or never seen in training) labels? On 16 July 2014 17:13, Arnaud Joly a.j...@ulg.ac.be wrote: Hi This looks like a regression. Can you open an issue on github? I am not sure that it would make sense to add a unknown columns label with an optional parameter. But you could easily add one with some numpy operations np.hstack([y, y.sum(axis=1,keepdims=True) == 0]) Best regards, Arnaud On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote: Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before. Prior 0.15, this worked: lb = LabelBinarizer() lb.fit_transform(['a', 'b', 'c']) array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) lb.transform(['a', 'd', 'e']) array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) Note that both values 'd' and 'e', having never been seen while the LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by transform, which I interpreted and used as an unknown class, useful in cases where your test data can contain values which could not be known in advance, i.e. at the time of training (and also, to avoid data leakage while doing cross-validation). With 0.15, the same code now gives the error: [...] ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd' 'e']found in the data I wrote about this question a couple of months ago, in regard of a similar issue with the LabelEncoder: http://sourceforge.net/p/scikit-learn/mailman/message/31827616/ So if my understanding of this mechanism is correct (as well as my assumptions about the way it is/should be used), would it make sense to add something like a map_unknowns_to_single_class extra parameter to all the preprocessing encoders, so that this behaviour can be at least implemented optionally? Thanks, Christian -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list