Re: [Numpy-discussion] Matrix Class

2015-02-14 Thread cjw

On 14-Feb-15 11:35 AM, josef.p...@gmail.com wrote:
 On Wed, Feb 11, 2015 at 4:18 PM, Ryan Nelson rnelsonc...@gmail.com wrote:
 Colin,

 I currently use Py3.4 and Numpy 1.9.1. However, I built a quick test conda
 environment with Python2.7 and Numpy 1.7.0, and I get the same:

 
 Python 2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014, 16:57:52)
 [MSC v
 .1500 64 bit (AMD64)]
 Type copyright, credits or license for more information.

 IPython 2.3.1 -- An enhanced Interactive Python.
 Anaconda is brought to you by Continuum Analytics.
 Please check out: http://continuum.io/thanks and https://binstar.org
 ? - Introduction and overview of IPython's features.
 %quickref - Quick reference.
 help  - Python's own help system.
 object?   - Details about 'object', use 'object??' for extra details.

 In [1]: import numpy as np

 In [2]: np.__version__
 Out[2]: '1.7.0'

 In [3]: np.mat([4,'5',6])
 Out[3]:
 matrix([['4', '5', '6']],
 dtype='|S1')

 In [4]: np.mat([4,'5',6], dtype=int)
 Out[4]: matrix([[4, 5, 6]])
 ###

 As to your comment about coordinating with Statsmodels, you should see the
 links in the thread that Alan posted:
 http://permalink.gmane.org/gmane.comp.python.numeric.general/56516
 http://permalink.gmane.org/gmane.comp.python.numeric.general/56517
 Josef's comments at the time seem to echo the issues the devs (and others)
 have with the matrix class. Maybe things have changed with Statsmodels.
 Not changed, we have a strict policy against using np.matrix.

 generic efficient versions for linear operators, kronecker or sparse
 block matrix styly operations would be useful, but I would use array
 semantics, similar to using dot or linalg functions on ndarrays.

 Josef
 (long reply canceled because I'm writing too much that might only be
 of tangential interest or has been in some of the matrix discussion
 before.)
Josef,

Many thanks.  I have gained the impression that there is some antipathy 
to np.matrix, perhaps this is because, as others have suggested, the 
array doesn't provide an appropriate framework.

Where are such policy decisions documented?  Numpy doesn't appear to 
have a BDFL.

I had read Alan's links back in February and now have note of them.

Colin W.



 I know I mentioned Sage and SageMathCloud before. I'll just point out that
 there are folks that use this for real research problems, not just as a
 pedagogical tool. They have a Matrix/vector/column_matrix class that do what
 you were expecting from your problems posted above. Indeed below is a
 (truncated) cut and past from a Sage Worksheet. (See
 http://www.sagemath.org/doc/tutorial/tour_linalg.html)
 ##
 In : Matrix([1,'2',3])
 Error in lines 1-1
 Traceback (most recent call last):
 TypeError: unable to find a common ring for all elements

 In : Matrix([[1,2,3],[4,5]])
 ValueError: List of rows is not valid (rows are wrong types or lengths)

 In : vector([1,2,3])
 (1, 2, 3)

 In : column_matrix([1,2,3])
 [1]
 [2]
 [3]
 ##

 Large portions of the custom code and wrappers in Sage are written in
 Python. I don't think their Matrix object is a subclass of ndarray, so
 perhaps you could strip out the Matrix stuff from here to make a separate
 project with just the Matrix stuff, if you don't want to go through the Sage
 interface.


 On Wed, Feb 11, 2015 at 11:54 AM, cjw c...@ncf.ca wrote:

 On 11-Feb-15 10:21 AM, Ryan Nelson wrote:

 So:

 In [2]: np.mat([4,'5',6])
 Out[2]:
 matrix([['4', '5', '6']], dtype='U11')

 In [3]: np.mat([4,'5',6], dtype=int)
 Out[3]: matrix([[4, 5, 6]])

 Thanks Ryan,

 We are not singing from the same hymn book.

 Using PyScripter, I get:

 *** Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit
 (AMD64)] on win32. ***
 import numpy as np
 print('Numpy version: ', np.__version__)
 ('Numpy version: ', '1.9.0')
 Could you say which version you are using please?

 Colin W

 On Tue, Feb 10, 2015 at 5:07 PM, cjw c...@ncf.ca wrote:

 It seems to be agreed that there are weaknesses in the existing Numpy
 Matrix
 Class.

 Some problems are illustrated below.

 I'll try to put some suggestions over the coming weeks and would
 appreciate
 comments.

 Colin W.

 Test Script:

 if __name__ == '__main__':
  a= mat([4, 5, 6])   # Good
  print('a: ', a)
  b= mat([4, '5', 6]) # Not the expected result
  print('b: ', b)
  c= mat([[4, 5, 6], [7, 8]]) # Wrongly accepted as rectangular
  print('c: ', c)
  d= mat([[1, 2, 3]])
  try:
  d[0, 1]= 'b'# Correctly flagged, not numeric
  except ValueError:
  print(d[0, 1]= 'b' # Correctly flagged, not numeric,
 '
 ValueError')
  print('d: ', d)

 Result:

 *** Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit
 (AMD64)] on win32. ***

 a:  [[4 5 6]]
 b:  [['4' '5' '6']]
 c:  [[[4, 5, 6] [7, 8]]]
 d[0, 1]= 'b' # Correctly flagged, not numeric  ValueError
 

Re: [Numpy-discussion] Matrix Class

2015-02-14 Thread josef.pktd
On Sat, Feb 14, 2015 at 12:05 PM, cjw c...@ncf.ca wrote:

 On 14-Feb-15 11:35 AM, josef.p...@gmail.com wrote:

 On Wed, Feb 11, 2015 at 4:18 PM, Ryan Nelson rnelsonc...@gmail.com
 wrote:

 Colin,

 I currently use Py3.4 and Numpy 1.9.1. However, I built a quick test
 conda
 environment with Python2.7 and Numpy 1.7.0, and I get the same:

 
 Python 2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014, 16:57:52)
 [MSC v
 .1500 64 bit (AMD64)]
 Type copyright, credits or license for more information.

 IPython 2.3.1 -- An enhanced Interactive Python.
 Anaconda is brought to you by Continuum Analytics.
 Please check out: http://continuum.io/thanks and https://binstar.org
 ? - Introduction and overview of IPython's features.
 %quickref - Quick reference.
 help  - Python's own help system.
 object?   - Details about 'object', use 'object??' for extra details.

 In [1]: import numpy as np

 In [2]: np.__version__
 Out[2]: '1.7.0'

 In [3]: np.mat([4,'5',6])
 Out[3]:
 matrix([['4', '5', '6']],
 dtype='|S1')

 In [4]: np.mat([4,'5',6], dtype=int)
 Out[4]: matrix([[4, 5, 6]])
 ###

 As to your comment about coordinating with Statsmodels, you should see
 the
 links in the thread that Alan posted:
 http://permalink.gmane.org/gmane.comp.python.numeric.general/56516
 http://permalink.gmane.org/gmane.comp.python.numeric.general/56517
 Josef's comments at the time seem to echo the issues the devs (and
 others)
 have with the matrix class. Maybe things have changed with Statsmodels.

 Not changed, we have a strict policy against using np.matrix.

 generic efficient versions for linear operators, kronecker or sparse
 block matrix styly operations would be useful, but I would use array
 semantics, similar to using dot or linalg functions on ndarrays.

 Josef
 (long reply canceled because I'm writing too much that might only be
 of tangential interest or has been in some of the matrix discussion
 before.)

 Josef,

 Many thanks.  I have gained the impression that there is some antipathy to
 np.matrix, perhaps this is because, as others have suggested, the array
 doesn't provide an appropriate framework.

It's not directly antipathy, it's cost-benefit analysis.

np.matrix has few advantages, but makes reading and maintaining code
much more difficult.
Having to watch out for multiplication `*` is a lot of extra work.

Checking shapes and fixing bugs with unexpected dtypes is also a lot
of work, but we have large benefits.
For a long time the policy in statsmodels was to keep pandas out of
the core of functions (i.e. out of the actual calculations) and
restrict it to inputs and returns. However, pandas is becoming more
popular and can do some things much better than plain numpy, so it is
slowly moving inside some of our core calculations.
It's still an easy source of bugs, but we do gain something.

Benefits like these don't exist for np.matrix.


 Where are such policy decisions documented?  Numpy doesn't appear to have a
 BDFL.

In general it's a mix of mailing list discussions and discussion in
issues and PRs.
I'm not directly involved in numpy and don't subscribe to the numpy's
github notifications.

For scipy (and partially for statsmodels): I think large parts of
policies for code and workflow are not explicitly specified, but are
more an understanding of maintainers and developers that can slowly
change over time, build up through spread out discussion as temporary
consensus (or without strong objections).
scipy has a hacking text file to describe some of it, but I haven't
read it in ages.

(long term changes compared to 6 years ago: required code review and
required test coverage.)

Josef



 I had read Alan's links back in February and now have note of them.

 Colin W.




 I know I mentioned Sage and SageMathCloud before. I'll just point out
 that
 there are folks that use this for real research problems, not just as a
 pedagogical tool. They have a Matrix/vector/column_matrix class that do
 what
 you were expecting from your problems posted above. Indeed below is a
 (truncated) cut and past from a Sage Worksheet. (See
 http://www.sagemath.org/doc/tutorial/tour_linalg.html)
 ##
 In : Matrix([1,'2',3])
 Error in lines 1-1
 Traceback (most recent call last):
 TypeError: unable to find a common ring for all elements

 In : Matrix([[1,2,3],[4,5]])
 ValueError: List of rows is not valid (rows are wrong types or lengths)

 In : vector([1,2,3])
 (1, 2, 3)

 In : column_matrix([1,2,3])
 [1]
 [2]
 [3]
 ##

 Large portions of the custom code and wrappers in Sage are written in
 Python. I don't think their Matrix object is a subclass of ndarray, so
 perhaps you could strip out the Matrix stuff from here to make a separate
 project with just the Matrix stuff, if you don't want to go through the
 Sage
 interface.


 On Wed, Feb 11, 2015 at 11:54 AM, cjw c...@ncf.ca wrote:


 On 11-Feb-15 10:21 AM, Ryan Nelson wrote:

 So:

 In [2]: np.mat([4,'5',6])
 Out[2]:
 

Re: [Numpy-discussion] Matrix Class

2015-02-14 Thread Charles R Harris
On Sat, Feb 14, 2015 at 12:36 PM, josef.p...@gmail.com wrote:

 On Sat, Feb 14, 2015 at 12:05 PM, cjw c...@ncf.ca wrote:
 
  On 14-Feb-15 11:35 AM, josef.p...@gmail.com wrote:
 
  On Wed, Feb 11, 2015 at 4:18 PM, Ryan Nelson rnelsonc...@gmail.com
  wrote:
 
  Colin,
 
  I currently use Py3.4 and Numpy 1.9.1. However, I built a quick test
  conda
  environment with Python2.7 and Numpy 1.7.0, and I get the same:
 
  
  Python 2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014,
 16:57:52)
  [MSC v
  .1500 64 bit (AMD64)]
  Type copyright, credits or license for more information.
 
  IPython 2.3.1 -- An enhanced Interactive Python.
  Anaconda is brought to you by Continuum Analytics.
  Please check out: http://continuum.io/thanks and https://binstar.org
  ? - Introduction and overview of IPython's features.
  %quickref - Quick reference.
  help  - Python's own help system.
  object?   - Details about 'object', use 'object??' for extra details.
 
  In [1]: import numpy as np
 
  In [2]: np.__version__
  Out[2]: '1.7.0'
 
  In [3]: np.mat([4,'5',6])
  Out[3]:
  matrix([['4', '5', '6']],
  dtype='|S1')
 
  In [4]: np.mat([4,'5',6], dtype=int)
  Out[4]: matrix([[4, 5, 6]])
  ###
 
  As to your comment about coordinating with Statsmodels, you should see
  the
  links in the thread that Alan posted:
  http://permalink.gmane.org/gmane.comp.python.numeric.general/56516
  http://permalink.gmane.org/gmane.comp.python.numeric.general/56517
  Josef's comments at the time seem to echo the issues the devs (and
  others)
  have with the matrix class. Maybe things have changed with Statsmodels.
 
  Not changed, we have a strict policy against using np.matrix.
 
  generic efficient versions for linear operators, kronecker or sparse
  block matrix styly operations would be useful, but I would use array
  semantics, similar to using dot or linalg functions on ndarrays.
 
  Josef
  (long reply canceled because I'm writing too much that might only be
  of tangential interest or has been in some of the matrix discussion
  before.)
 
  Josef,
 
  Many thanks.  I have gained the impression that there is some antipathy
 to
  np.matrix, perhaps this is because, as others have suggested, the array
  doesn't provide an appropriate framework.

 It's not directly antipathy, it's cost-benefit analysis.

 np.matrix has few advantages, but makes reading and maintaining code
 much more difficult.
 Having to watch out for multiplication `*` is a lot of extra work.

 Checking shapes and fixing bugs with unexpected dtypes is also a lot
 of work, but we have large benefits.
 For a long time the policy in statsmodels was to keep pandas out of
 the core of functions (i.e. out of the actual calculations) and
 restrict it to inputs and returns. However, pandas is becoming more
 popular and can do some things much better than plain numpy, so it is
 slowly moving inside some of our core calculations.
 It's still an easy source of bugs, but we do gain something.


Any bits of Pandas that might be good for numpy/scipy to steal?

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Matrix Class

2015-02-14 Thread josef.pktd
On Sat, Feb 14, 2015 at 4:27 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Sat, Feb 14, 2015 at 12:36 PM, josef.p...@gmail.com wrote:

 On Sat, Feb 14, 2015 at 12:05 PM, cjw c...@ncf.ca wrote:
 
  On 14-Feb-15 11:35 AM, josef.p...@gmail.com wrote:
 
  On Wed, Feb 11, 2015 at 4:18 PM, Ryan Nelson rnelsonc...@gmail.com
  wrote:
 
  Colin,
 
  I currently use Py3.4 and Numpy 1.9.1. However, I built a quick test
  conda
  environment with Python2.7 and Numpy 1.7.0, and I get the same:
 
  
  Python 2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014,
  16:57:52)
  [MSC v
  .1500 64 bit (AMD64)]
  Type copyright, credits or license for more information.
 
  IPython 2.3.1 -- An enhanced Interactive Python.
  Anaconda is brought to you by Continuum Analytics.
  Please check out: http://continuum.io/thanks and https://binstar.org
  ? - Introduction and overview of IPython's features.
  %quickref - Quick reference.
  help  - Python's own help system.
  object?   - Details about 'object', use 'object??' for extra details.
 
  In [1]: import numpy as np
 
  In [2]: np.__version__
  Out[2]: '1.7.0'
 
  In [3]: np.mat([4,'5',6])
  Out[3]:
  matrix([['4', '5', '6']],
  dtype='|S1')
 
  In [4]: np.mat([4,'5',6], dtype=int)
  Out[4]: matrix([[4, 5, 6]])
  ###
 
  As to your comment about coordinating with Statsmodels, you should see
  the
  links in the thread that Alan posted:
  http://permalink.gmane.org/gmane.comp.python.numeric.general/56516
  http://permalink.gmane.org/gmane.comp.python.numeric.general/56517
  Josef's comments at the time seem to echo the issues the devs (and
  others)
  have with the matrix class. Maybe things have changed with
  Statsmodels.
 
  Not changed, we have a strict policy against using np.matrix.
 
  generic efficient versions for linear operators, kronecker or sparse
  block matrix styly operations would be useful, but I would use array
  semantics, similar to using dot or linalg functions on ndarrays.
 
  Josef
  (long reply canceled because I'm writing too much that might only be
  of tangential interest or has been in some of the matrix discussion
  before.)
 
  Josef,
 
  Many thanks.  I have gained the impression that there is some antipathy
  to
  np.matrix, perhaps this is because, as others have suggested, the array
  doesn't provide an appropriate framework.

 It's not directly antipathy, it's cost-benefit analysis.

 np.matrix has few advantages, but makes reading and maintaining code
 much more difficult.
 Having to watch out for multiplication `*` is a lot of extra work.

 Checking shapes and fixing bugs with unexpected dtypes is also a lot
 of work, but we have large benefits.
 For a long time the policy in statsmodels was to keep pandas out of
 the core of functions (i.e. out of the actual calculations) and
 restrict it to inputs and returns. However, pandas is becoming more
 popular and can do some things much better than plain numpy, so it is
 slowly moving inside some of our core calculations.
 It's still an easy source of bugs, but we do gain something.


 Any bits of Pandas that might be good for numpy/scipy to steal?

I'm not a Pandas expert.
Some of it comes into statsmodels because we need the data handling
also inside a function, e.g. keeping track of labels, indices, and so
on. Another reason is that contributors are more familiar with
pandas's way of solving a problems, even if I suspect numpy would be
more efficient.

However, a recent change, replaces where I would have used np.unique
with pandas.factorize which is supposed to be faster.
https://github.com/statsmodels/statsmodels/pull/2213

Two or three years ago my numpy way of group handling (using
np.unique, bincount and similar) was still faster than the pandas
`apply` version, I'm not sure that's still true.


And to emphasize: all our heavy stuff especially the big models still
only have numpy and scipy inside (with the exception of one model
waiting in a PR).

Josef



 snip

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Matrix Class

2015-02-14 Thread Jaime Fernández del Río
On Sat, Feb 14, 2015 at 5:21 PM, josef.p...@gmail.com wrote:

 On Sat, Feb 14, 2015 at 4:27 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
  On Sat, Feb 14, 2015 at 12:36 PM, josef.p...@gmail.com wrote:
 
  On Sat, Feb 14, 2015 at 12:05 PM, cjw c...@ncf.ca wrote:
  
   On 14-Feb-15 11:35 AM, josef.p...@gmail.com wrote:
  
   On Wed, Feb 11, 2015 at 4:18 PM, Ryan Nelson rnelsonc...@gmail.com
   wrote:
  
   Colin,
  
   I currently use Py3.4 and Numpy 1.9.1. However, I built a quick test
   conda
   environment with Python2.7 and Numpy 1.7.0, and I get the same:
  
   
   Python 2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014,
   16:57:52)
   [MSC v
   .1500 64 bit (AMD64)]
   Type copyright, credits or license for more information.
  
   IPython 2.3.1 -- An enhanced Interactive Python.
   Anaconda is brought to you by Continuum Analytics.
   Please check out: http://continuum.io/thanks and
 https://binstar.org
   ? - Introduction and overview of IPython's features.
   %quickref - Quick reference.
   help  - Python's own help system.
   object?   - Details about 'object', use 'object??' for extra
 details.
  
   In [1]: import numpy as np
  
   In [2]: np.__version__
   Out[2]: '1.7.0'
  
   In [3]: np.mat([4,'5',6])
   Out[3]:
   matrix([['4', '5', '6']],
   dtype='|S1')
  
   In [4]: np.mat([4,'5',6], dtype=int)
   Out[4]: matrix([[4, 5, 6]])
   ###
  
   As to your comment about coordinating with Statsmodels, you should
 see
   the
   links in the thread that Alan posted:
   http://permalink.gmane.org/gmane.comp.python.numeric.general/56516
   http://permalink.gmane.org/gmane.comp.python.numeric.general/56517
   Josef's comments at the time seem to echo the issues the devs (and
   others)
   have with the matrix class. Maybe things have changed with
   Statsmodels.
  
   Not changed, we have a strict policy against using np.matrix.
  
   generic efficient versions for linear operators, kronecker or sparse
   block matrix styly operations would be useful, but I would use array
   semantics, similar to using dot or linalg functions on ndarrays.
  
   Josef
   (long reply canceled because I'm writing too much that might only be
   of tangential interest or has been in some of the matrix discussion
   before.)
  
   Josef,
  
   Many thanks.  I have gained the impression that there is some
 antipathy
   to
   np.matrix, perhaps this is because, as others have suggested, the
 array
   doesn't provide an appropriate framework.
 
  It's not directly antipathy, it's cost-benefit analysis.
 
  np.matrix has few advantages, but makes reading and maintaining code
  much more difficult.
  Having to watch out for multiplication `*` is a lot of extra work.
 
  Checking shapes and fixing bugs with unexpected dtypes is also a lot
  of work, but we have large benefits.
  For a long time the policy in statsmodels was to keep pandas out of
  the core of functions (i.e. out of the actual calculations) and
  restrict it to inputs and returns. However, pandas is becoming more
  popular and can do some things much better than plain numpy, so it is
  slowly moving inside some of our core calculations.
  It's still an easy source of bugs, but we do gain something.
 
 
  Any bits of Pandas that might be good for numpy/scipy to steal?

 I'm not a Pandas expert.
 Some of it comes into statsmodels because we need the data handling
 also inside a function, e.g. keeping track of labels, indices, and so
 on. Another reason is that contributors are more familiar with
 pandas's way of solving a problems, even if I suspect numpy would be
 more efficient.

 However, a recent change, replaces where I would have used np.unique
 with pandas.factorize which is supposed to be faster.
 https://github.com/statsmodels/statsmodels/pull/2213


Numpy could use some form of hash table for its arraysetops, which is where
pandas is getting its advantage from. It is a tricky thing though, see e.g.
these timings:

a = np.ranomdom.randint(10, size=1000)
srs = pd.Series(a)

%timeit np.unique(a)

10 loops, best of 3: 13.2 µs per loop

%timeit srs.unique()

10 loops, best of 3: 15.6 µs per loop


%timeit pd.factorize(a)

1 loops, best of 3: 25.6 µs per loop

%timeit np.unique(a, return_inverse=True)

1 loops, best of 3: 82.5 µs per loop

This last timings are with 1.9.0 an 0.14.0, so numpy doesn't have
https://github.com/numpy/numpy/pull/5012 yet, which makes the operation in
which numpy is slower about 2x faster. And if you need your unique values
sorted, then things are more even, especially if numpy runs 2x faster:

%timeit pd.factorize(a, sort=True)

1 loops, best of 3: 36.4 µs per loop

The algorithms scale differently though, so for sufficiently large data
Pandas is going to win almost certainly. Not sure if they support all
dtypes, nor how efficient their use of memory is.

I did a toy implementation of a hash table, mimicking Python's dictionary,
for numpy some time