[Scikit-learn-general] Cython profiling question

2014-07-16 Thread Andy
Hey all.
A slightly off-topic question about cython profiling.
I'm pretty sure I could use yep for profiling, as mentioned in the docs: 
http://scikit-learn.org/dev/developers/performance.html#profiling-compiled-extensions
and get line-by-line counts.
However I did not manage to do that recently. I ususally got the cython 
gcc commands and removed the NDEBUG and added -gp.
The docs also say I need a debug version of python and numpy, but I'm 
pretty sure I never did that before.
Can anyone confirm or deny how to do this? If I figure out a good way 
I'm totally volunteering to add it to the docs ;)

Cheers,
Andy

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Cython profiling question

2014-07-16 Thread Lars Buitinck
2014-07-16 16:43 GMT+02:00 Andy t3k...@gmail.com:
 I'm pretty sure I could use yep for profiling, as mentioned in the docs:
 http://scikit-learn.org/dev/developers/performance.html#profiling-compiled-extensions
 and get line-by-line counts.
 However I did not manage to do that recently. I ususally got the cython
 gcc commands and removed the NDEBUG and added -gp.
 The docs also say I need a debug version of python and numpy, but I'm
 pretty sure I never did that before.

sudo apt-get install python-{numpy,scipy}-dbg

I must admit I never did this; I use Google Perftools.

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Cython profiling question

2014-07-16 Thread Andy
On 07/16/2014 05:17 PM, Lars Buitinck wrote:
 2014-07-16 16:43 GMT+02:00 Andy t3k...@gmail.com:
 I'm pretty sure I could use yep for profiling, as mentioned in the docs:
 http://scikit-learn.org/dev/developers/performance.html#profiling-compiled-extensions
 and get line-by-line counts.
 However I did not manage to do that recently. I ususally got the cython
 gcc commands and removed the NDEBUG and added -gp.
 The docs also say I need a debug version of python and numpy, but I'm
 pretty sure I never did that before.
 sudo apt-get install python-{numpy,scipy}-dbg

 I must admit I never did this; I use Google Perftools.
That is using google perftools.
So you get line-by-line with google perftools without using debugging 
versions? How?


--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Cython profiling question

2014-07-16 Thread Lars Buitinck
2014-07-16 17:29 GMT+02:00 Andy t3k...@gmail.com:
 That is using google perftools.

I thought you were referring to the bit about gprof.

 So you get line-by-line with google perftools without using debugging
 versions? How?

I don't, I look at per-function cost.

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Cython profiling question

2014-07-16 Thread Ronnie Ghose
now way to really get line by line without some sort of debug afaik, it
throws the info away at compile time



On Wed, Jul 16, 2014 at 11:44 AM, Lars Buitinck larsm...@gmail.com wrote:

 2014-07-16 17:29 GMT+02:00 Andy t3k...@gmail.com:
  That is using google perftools.

 I thought you were referring to the bit about gprof.

  So you get line-by-line with google perftools without using debugging
  versions? How?

 I don't, I look at per-function cost.


 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Christian Jauvin
Hi,

I have noticed a change with the LabelBinarizer between version 0.15
and those before.

Prior 0.15, this worked:

 lb = LabelBinarizer()
 lb.fit_transform(['a', 'b', 'c'])
array([[1, 0, 0],
   [0, 1, 0],
   [0, 0, 1]])
 lb.transform(['a', 'd', 'e'])
array([[1, 0, 0],
   [0, 0, 0],
   [0, 0, 0]])

Note that both values 'd' and 'e', having never been seen while the
LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by
transform, which I interpreted and used as an unknown class,
useful in cases where your test data can contain values which could
not be known in advance, i.e. at the time of training (and also, to
avoid data leakage while doing cross-validation).

With 0.15, the same code now gives the error:

[...]
ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd'
'e']found in the data

I wrote about this question a couple of months ago, in regard of a
similar issue with the LabelEncoder:

http://sourceforge.net/p/scikit-learn/mailman/message/31827616/

So if my understanding of this mechanism is correct (as well as my
assumptions about the way it is/should be used), would it make sense
to add something like a map_unknowns_to_single_class extra parameter
to all the preprocessing encoders, so that this behaviour can be at
least implemented optionally?

Thanks,

Christian

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Arnaud Joly
Hi

This looks like a regression. Can you open an issue on github?

I am not sure that it would make sense to add a unknown columns
label with an optional parameter. But you could easily add one with 
some numpy operations

np.hstack([y, y.sum(axis=1,keepdims=True) == 0])

Best regards,
Arnaud


On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote:

 Hi,
 
 I have noticed a change with the LabelBinarizer between version 0.15
 and those before.
 
 Prior 0.15, this worked:
 
 lb = LabelBinarizer()
 lb.fit_transform(['a', 'b', 'c'])
 array([[1, 0, 0],
   [0, 1, 0],
   [0, 0, 1]])
 lb.transform(['a', 'd', 'e'])
 array([[1, 0, 0],
   [0, 0, 0],
   [0, 0, 0]])
 
 Note that both values 'd' and 'e', having never been seen while the
 LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by
 transform, which I interpreted and used as an unknown class,
 useful in cases where your test data can contain values which could
 not be known in advance, i.e. at the time of training (and also, to
 avoid data leakage while doing cross-validation).
 
 With 0.15, the same code now gives the error:
 
 [...]
 ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd'
 'e']found in the data
 
 I wrote about this question a couple of months ago, in regard of a
 similar issue with the LabelEncoder:
 
 http://sourceforge.net/p/scikit-learn/mailman/message/31827616/
 
 So if my understanding of this mechanism is correct (as well as my
 assumptions about the way it is/should be used), would it make sense
 to add something like a map_unknowns_to_single_class extra parameter
 to all the preprocessing encoders, so that this behaviour can be at
 least implemented optionally?
 
 Thanks,
 
 Christian
 
 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Christian Jauvin
I can open an issue, but on the other hand, you could argue that the
new behaviour is now at least consistent with the other encoder types,
e.g.:

 le = LabelEncoder()
 le.fit_transform(['a', 'b', 'c'])
array([0, 1, 2])
 le.transform(['a', 'd', 'e'])
[...]
ValueError: y contains new labels: ['d' 'e']

I'm curious to know why you don't think an option to map unknown
values to a single extra class would make sense? It makes me think
that I'm perhaps not using the encoders in the intended way, but it
would seem that this is a pretty common use case no? And if so, how do
you typically deal with the issue of unknown (or never seen in
training) labels?



On 16 July 2014 17:13, Arnaud Joly a.j...@ulg.ac.be wrote:
 Hi

 This looks like a regression. Can you open an issue on github?

 I am not sure that it would make sense to add a unknown columns
 label with an optional parameter. But you could easily add one with
 some numpy operations

 np.hstack([y, y.sum(axis=1,keepdims=True) == 0])

 Best regards,
 Arnaud


 On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote:

 Hi,

 I have noticed a change with the LabelBinarizer between version 0.15
 and those before.

 Prior 0.15, this worked:

 lb = LabelBinarizer()
 lb.fit_transform(['a', 'b', 'c'])

 array([[1, 0, 0],
   [0, 1, 0],
   [0, 0, 1]])

 lb.transform(['a', 'd', 'e'])

 array([[1, 0, 0],
   [0, 0, 0],
   [0, 0, 0]])

 Note that both values 'd' and 'e', having never been seen while the
 LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by
 transform, which I interpreted and used as an unknown class,
 useful in cases where your test data can contain values which could
 not be known in advance, i.e. at the time of training (and also, to
 avoid data leakage while doing cross-validation).

 With 0.15, the same code now gives the error:

 [...]
 ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd'
 'e']found in the data

 I wrote about this question a couple of months ago, in regard of a
 similar issue with the LabelEncoder:

 http://sourceforge.net/p/scikit-learn/mailman/message/31827616/

 So if my understanding of this mechanism is correct (as well as my
 assumptions about the way it is/should be used), would it make sense
 to add something like a map_unknowns_to_single_class extra parameter
 to all the preprocessing encoders, so that this behaviour can be at
 least implemented optionally?

 Thanks,

 Christian

 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Joel Nothman
cf. https://github.com/scikit-learn/scikit-learn/pull/3243


On 17 July 2014 08:59, Christian Jauvin cjau...@gmail.com wrote:

 I can open an issue, but on the other hand, you could argue that the
 new behaviour is now at least consistent with the other encoder types,
 e.g.:

  le = LabelEncoder()
  le.fit_transform(['a', 'b', 'c'])
 array([0, 1, 2])
  le.transform(['a', 'd', 'e'])
 [...]
 ValueError: y contains new labels: ['d' 'e']

 I'm curious to know why you don't think an option to map unknown
 values to a single extra class would make sense? It makes me think
 that I'm perhaps not using the encoders in the intended way, but it
 would seem that this is a pretty common use case no? And if so, how do
 you typically deal with the issue of unknown (or never seen in
 training) labels?



 On 16 July 2014 17:13, Arnaud Joly a.j...@ulg.ac.be wrote:
  Hi
 
  This looks like a regression. Can you open an issue on github?
 
  I am not sure that it would make sense to add a unknown columns
  label with an optional parameter. But you could easily add one with
  some numpy operations
 
  np.hstack([y, y.sum(axis=1,keepdims=True) == 0])
 
  Best regards,
  Arnaud
 
 
  On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote:
 
  Hi,
 
  I have noticed a change with the LabelBinarizer between version 0.15
  and those before.
 
  Prior 0.15, this worked:
 
  lb = LabelBinarizer()
  lb.fit_transform(['a', 'b', 'c'])
 
  array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
 
  lb.transform(['a', 'd', 'e'])
 
  array([[1, 0, 0],
[0, 0, 0],
[0, 0, 0]])
 
  Note that both values 'd' and 'e', having never been seen while the
  LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by
  transform, which I interpreted and used as an unknown class,
  useful in cases where your test data can contain values which could
  not be known in advance, i.e. at the time of training (and also, to
  avoid data leakage while doing cross-validation).
 
  With 0.15, the same code now gives the error:
 
  [...]
  ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd'
  'e']found in the data
 
  I wrote about this question a couple of months ago, in regard of a
  similar issue with the LabelEncoder:
 
  http://sourceforge.net/p/scikit-learn/mailman/message/31827616/
 
  So if my understanding of this mechanism is correct (as well as my
  assumptions about the way it is/should be used), would it make sense
  to add something like a map_unknowns_to_single_class extra parameter
  to all the preprocessing encoders, so that this behaviour can be at
  least implemented optionally?
 
  Thanks,
 
  Christian
 
 
 --
  Want fast and easy access to all the code in your enterprise? Index and
  search up to 200,000 lines of code with a free copy of Black Duck
  Code Sight - the same software that powers the world's largest code
  search on Ohloh, the Black Duck Open Hub! Try it now.
  http://p.sf.net/sfu/bds
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 
 --
  Want fast and easy access to all the code in your enterprise? Index and
  search up to 200,000 lines of code with a free copy of Black Duck
  Code Sight - the same software that powers the world's largest code
  search on Ohloh, the Black Duck Open Hub! Try it now.
  http://p.sf.net/sfu/bds
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Michael Bommarito
Relevant to this:
https://github.com/scikit-learn/scikit-learn/pull/3243

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387


On Wed, Jul 16, 2014 at 6:59 PM, Christian Jauvin cjau...@gmail.com wrote:

 I can open an issue, but on the other hand, you could argue that the
 new behaviour is now at least consistent with the other encoder types,
 e.g.:

  le = LabelEncoder()
  le.fit_transform(['a', 'b', 'c'])
 array([0, 1, 2])
  le.transform(['a', 'd', 'e'])
 [...]
 ValueError: y contains new labels: ['d' 'e']

 I'm curious to know why you don't think an option to map unknown
 values to a single extra class would make sense? It makes me think
 that I'm perhaps not using the encoders in the intended way, but it
 would seem that this is a pretty common use case no? And if so, how do
 you typically deal with the issue of unknown (or never seen in
 training) labels?



 On 16 July 2014 17:13, Arnaud Joly a.j...@ulg.ac.be wrote:
  Hi
 
  This looks like a regression. Can you open an issue on github?
 
  I am not sure that it would make sense to add a unknown columns
  label with an optional parameter. But you could easily add one with
  some numpy operations
 
  np.hstack([y, y.sum(axis=1,keepdims=True) == 0])
 
  Best regards,
  Arnaud
 
 
  On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote:
 
  Hi,
 
  I have noticed a change with the LabelBinarizer between version 0.15
  and those before.
 
  Prior 0.15, this worked:
 
  lb = LabelBinarizer()
  lb.fit_transform(['a', 'b', 'c'])
 
  array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
 
  lb.transform(['a', 'd', 'e'])
 
  array([[1, 0, 0],
[0, 0, 0],
[0, 0, 0]])
 
  Note that both values 'd' and 'e', having never been seen while the
  LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by
  transform, which I interpreted and used as an unknown class,
  useful in cases where your test data can contain values which could
  not be known in advance, i.e. at the time of training (and also, to
  avoid data leakage while doing cross-validation).
 
  With 0.15, the same code now gives the error:
 
  [...]
  ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd'
  'e']found in the data
 
  I wrote about this question a couple of months ago, in regard of a
  similar issue with the LabelEncoder:
 
  http://sourceforge.net/p/scikit-learn/mailman/message/31827616/
 
  So if my understanding of this mechanism is correct (as well as my
  assumptions about the way it is/should be used), would it make sense
  to add something like a map_unknowns_to_single_class extra parameter
  to all the preprocessing encoders, so that this behaviour can be at
  least implemented optionally?
 
  Thanks,
 
  Christian
 
 
 --
  Want fast and easy access to all the code in your enterprise? Index and
  search up to 200,000 lines of code with a free copy of Black Duck
  Code Sight - the same software that powers the world's largest code
  search on Ohloh, the Black Duck Open Hub! Try it now.
  http://p.sf.net/sfu/bds
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 
 --
  Want fast and easy access to all the code in your enterprise? Index and
  search up to 200,000 lines of code with a free copy of Black Duck
  Code Sight - the same software that powers the world's largest code
  search on Ohloh, the Black Duck Open Hub! Try it now.
  http://p.sf.net/sfu/bds
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list