[Numpy-discussion] ANN: pandas 0.11.0 released!

2013-04-23 Thread Wes McKinney
hi all,

We've released pandas 0.11.0, a big release that span 3 months of
continuous development, led primarily by the intrepid Jeff Reback
and y-p. The release brings many new features, performance and
API improvements, bug fixes, and other goodies.

Some highlights:

- New precision indexing fields loc, iloc, at, and iat, to reduce
  occasional ambiguity in the catch-all hitherto ix method.
- Expanded support for NumPy data types in DataFrame
- NumExpr integration to accelerate various operator evaluation
- New Cookbook and 10 minutes to pandas pages in the documentation by
Jeff Reback
- Improved DataFrame to CSV exporting performance
- Experimental rplot branch with faceted plots with matplotlib
  merged and open for community hacking

Source archives and Windows installers are on PyPI. Thanks to all
who contributed to this release, especially Jeff and y-p.

What's new: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html
Installers: http://pypi.python.org/pypi/pandas

$ git log v0.10.1..v0.11.0 --pretty=format:%aN | sort | uniq -c | sort -rn
308 y-p
279 jreback
 85 Vytautas Jancauskas
 74 Wes McKinney
 25 Stephen Lin
 22 Andy Hayden
 19 Chang She
 13 Wouter Overmeire
  8 Spencer Lyon
  6 Phillip Cloud
  6 Nicholaus E. Halecky
  5 Thierry Moisan
  5 Skipper Seabold
  4 waitingkuo
  4 Loïc Estève
  4 Jeff Reback
  4 Garrett Drapala
  4 Alvaro Tejero-Cantero
  3 lexual
  3 Dražen Lučanin
  3 dieterv77
  3 dengemann
  3 Dan Birken
  3 Adam Greenhall
  2 Will Furnass
  2 Vytautas Jančauskas
  2 Robert Gieseke
  2 Peter Prettenhofer
  2 Jonathan Chambers
  2 Dieter Vandenbussche
  2 Damien Garaud
  2 Christopher Whelan
  2 Chapman Siu
  2 Brad Buran
  1 vytas
  1 Tim Akinbo
  1 Thomas Kluyver
  1 thauck
  1 stephenwlin
  1 K.-Michael Aye
  1 Karmel Allison
  1 Jeremy Wagner
  1 James Casbon
  1 Illia Polosukhin
  1 Dražen Lučanin
  1 davidjameshumphreys
  1 Dan Davison
  1 Chris Withers
  1 Christian Geier
  1 anomrake

Happy data hacking!

- Wes

What is it
==
pandas is a Python package providing fast, flexible, and
expressive data structures designed to make working with
relational, time series, or any other kind of labeled data both
easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real world data analysis in
Python.

Links
=
Release Notes: http://github.com/pydata/pandas/blob/master/RELEASE.rst
Documentation: http://pandas.pydata.org
Installers: http://pypi.python.org/pypi/pandas
Code Repository: http://github.com/pydata/pandas
Mailing List: http://groups.google.com/group/pydata
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: pandas 0.10.1 is released

2013-01-22 Thread Wes McKinney
hi all,

We've released pandas 0.10.1 which includes many bug fixes from
0.10.0 (including a number of issues with the new file parser,
e.g. reading multiple files in separate threads), various
performance improvements, and major new PyTables/HDF5-based
functionality contributed by Jeff Reback. I strongly recommend
that all users upgrade.

Thanks to all who contributed to this release, especially Chang
She, Jeff Reback, and Yoval P.

As always source archives and Windows installers are on PyPI.

What's new: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html
Installers: http://pypi.python.org/pypi/pandas

$ git log v0.10.0..v0.10.1 --pretty=format:%aN | sort | uniq -c | sort -rn
 66 jreback
 59 Wes McKinney
 43 Chang She
 12 y-p
  5 Vincent Arel-Bundock
  4 Damien Garaud
  3 Christopher Whelan
  3 Andy Hayden
  2 Jay Parlar
  2 Dan Allan
  1 Thouis (Ray) Jones
  1 svaksha
  1 herrfz
  1 Garrett Drapala
  1 elpres
  1 Dieter Vandenbussche
  1 Anton I. Sipos

Happy data hacking!

- Wes

What is it
==
pandas is a Python package providing fast, flexible, and
expressive data structures designed to make working with
relational, time series, or any other kind of labeled data both
easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real world data analysis in
Python.

Links
=
Release Notes: http://github.com/pydata/pandas/blob/master/RELEASE.rst
Documentation: http://pandas.pydata.org
Installers: http://pypi.python.org/pypi/pandas
Code Repository: http://github.com/pydata/pandas
Mailing List: http://groups.google.com/group/pydata
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: pandas 0.10.0 released

2012-12-17 Thread Wes McKinney
hi all,

I'm super excited to announce the pandas 0.10.0 release. This is
a major release including a new high performance file reading
engine with tons of new user-facing functionality as well, a
bunch of work on the HDF5/PyTables integration layer,
much-expanded Unicode support, a new option/configuration
interface, integration with the Google Analytics API, and a wide
array of other new features, bug fixes, and performance
improvements. I strongly recommend that all users get upgraded as
soon as feasible. Many performance improvements made are quite
substantial over 0.9.x, see vbenchmarks at the end of the e-mail.

As of this release, we are no longer supporting Python 2.5. Also,
this is the first release to officially support Python 3.3.

Note: there are a number of minor, but necessary API changes that
long-time pandas users should pay attention to in the What's New.

Thanks to all who contributed to this release, especially Chang
She, Yoval P, and Jeff Reback (and everyone else listed in the
commit log!).

As always source archives and Windows installers are on PyPI.

What's new: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html
Installers: http://pypi.python.org/pypi/pandas

$ git log v0.9.1..v0.10.0 --pretty=format:%aN | sort | uniq -c | sort -rn
246 Wes McKinney
140 y-p
 99 Chang She
 45 jreback
 18 Abraham Flaxman
 17 Jeff Reback
 14 locojaydev
 11 Keith Hughitt
  5 Adam Obeng
  2 Dieter Vandenbussche
  1 zach powers
  1 Luke Lee
  1 Laurent Gautier
  1 Ken Van Haren
  1 Jay Bourque
  1 Donald Curtis
  1 Chris Mulligan
  1 alex arsenovic
  1 A. Flaxman

Happy data hacking!

- Wes

What is it
==
pandas is a Python package providing fast, flexible, and
expressive data structures designed to make working with
relational, time series, or any other kind of labeled data both
easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real world data analysis in
Python.

Links
=
Release Notes: http://github.com/pydata/pandas/blob/master/RELEASE.rst
Documentation: http://pandas.pydata.org
Installers: http://pypi.python.org/pypi/pandas
Code Repository: http://github.com/pydata/pandas
Mailing List: http://groups.google.com/group/pydata

Performance vs. v0.9.0
==

Benchmarks from https://github.com/pydata/pandas/tree/master/vb_suite
Ratio  1 means that v0.10.0 is faster

   v0.10.0 v0.9.0  ratio
name
unstack_sparse_keyspace 1.2813   144.1262 0.0089
groupby_frame_apply_overhead   20.1520   337.3330 0.0597
read_csv_comment2  25.3097   363.2860 0.0697
groupbym_frame_apply   75.1554   504.1661 0.1491
frame_iteritems_cached  0.0711 0.3919 0.1815
read_csv_thou_vb   35.2690   191.9360 0.1838
concat_small_frames12.901955.3561 0.2331
join_dataframe_integer_2key 5.818421.5823 0.2696
series_value_counts_strings 5.382419.1262 0.2814
append_frame_single_homogenous  0.3413 0.9319 0.3662
read_csv_vb18.408446.9500 0.3921
read_csv_standard  12.065129.9940 0.4023
panel_from_dict_all_different_indexes  73.6860   158.2949 0.4655
frame_constructor_ndarray   0.0471 0.0958 0.4918
groupby_first   3.8502 7.1988 0.5348
groupby_last3.6962 6.7792 0.5452
panel_from_dict_two_different_indexes  50.742886.4980 0.5866
append_frame_single_mixed   1.2950 2.1930 0.5905
frame_get_numeric_data  0.0695 0.1119 0.6212
replace_fillna  4.6349 7.0540 0.6571
frame_to_csv  281.9340   427.7921 0.6590
replace_replacena   4.7154 7.1207 0.6622
frame_iteritems 2.5862 3.7463 0.6903
series_align_int64_index   29.737041.2791 0.7204
join_dataframe_integer_key  1.7980 2.4303 0.7398
groupby_multi_size 31.006641.7001 0.7436
groupby_frame_singlekey_integer 2.3579 3.1649 0.7450
write_csv_standard326.8259   427.3241 0.7648
groupby_simple_compress_timing 41.211352.3993 0.7865
frame_fillna_inplace   16.284320.0491 0.8122
reindex_fillna_backfill 0.1364 0.1667 0.8181
groupby_multi_series_op15.291418.6651 0.8193
groupby_multi_cython   17.216920.4420 0.8422
frame_fillna_many_columns_pad

Re: [Numpy-discussion] memory-efficient loadtxt

2012-10-03 Thread Wes McKinney
On Monday, October 1, 2012, Chris Barker wrote:

 Paul,

 Nice to see someone working on these issues, but:

 I'm not sure the problem you are trying to solve -- accumulating in a
 list is pretty efficient anyway -- not a whole lot overhead.

 But if you do want to improve that, it may be better to change the
 accumulating method, rather than doing the double-read thing. Ive
 written, and posted here, code that provides an Acumulator that uses
 numpy internally, so not much memory overhead. In the end, it's not
 any faster than accumulating in a list and then converting to an
 array, but it does use less memory.

 I also have a Cython version that is not quite done (darn regular job
 getting in the way) that is both faster and more memory efficient.

 Also, frankly, just writing the array pre-allocation and re-sizeing
 code into loadtxt would not be a whole lot of code either, and would
 be both fast and memory efficient.

 Let mw know if you want any of my code to play with.

   However, I got the impression that someone was
  working on a More Advanced (TM) C-based file reader, which will
  replace loadtxt;

 yes -- I wonder what happened with that? Anyone?

 -CHB



 this patch is intended as a useful thing to have
  while we're waiting for that to appear.
 
  The patch passes all tests in the test suite, and documentation for
  the kwarg has been added. I've modified all tests to include the
  seekable kwarg, but that was mostly to check that all tests are passed
  also with this kwarg. I guess it's bit too late for 1.7.0 though?
 
  Should I make a pull request? I'm happy to take any and all
  suggestions before I do.
 
  Cheers
  Paul
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org javascript:;
  http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --

 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov javascript:;
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org javascript:;
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I've finally built a new, very fast C-based tokenizer/parser with type
inference, NA-handling, etc. for pandas sporadically over the last month--
it's almost ready to ship. It's roughly an order of magnitude faster than
loadtxt and uses very little temporary space. Should be easy to push
upstream into NumPy to replace the innards of np.loadtxt if I can get a bit
of help with the plumbing (it already yields structured arrays in addition
to pandas DataFrames so there isn't a great deal that needs doing).

Blog post with CPU and memory benchmarks to follow-- will post a link here.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy 1.7 release delays

2012-06-26 Thread Wes McKinney
On Tue, Jun 26, 2012 at 8:15 PM, Skipper Seabold jsseab...@gmail.com wrote:
 On Tue, Jun 26, 2012 at 7:59 PM, Fernando Perez fperez@gmail.com wrote:
 On Tue, Jun 26, 2012 at 1:10 PM, Travis Oliphant tra...@continuum.io wrote:
 One issues is the one that Sage identified about the array interface
 regression as noted by Jason.    Any other regressions from 1.5.x need to be
 addressed as well.    We'll have to decide on a case-by-case basis if there
 are issues that conflict with 1.6.x behavior.


 One thing this discussion made me think about, is that it would be
 great to identify a few key projects that:

 - use numpy heavily
 - have reasonably solid test suites

 and create a special build job that runs *those* test suites
 periodically.  Not necessarily on every last numpy commit, but at
 least on a reasonable schedule.

 I think having that kind of information readily available, and with
 the ability to switch which numpy branch/commit those tests do get run
 against, could be very valuable as an early warning system for numpy
 to know if an apparently inconsequential change has unexpected side
 effects downstream.

 In IPython we've really benefited greatly from our improved CI
 infrastructure, but that only goes as far as catching *our own*
 problems.  This kind of downstream integration testing could be very
 useful.


 +1. Was thinking the same thing.

 My uninformed opinion from the sidelines: For me, this begged the
 question of why projects would wait so long and be upgrading 1.5.x -
 1.7.x. it sounded to me like an outreach problem. The whole point of
 having release candidates is so that downstream users (and especially
 big public downstream libraries) can test the release candidate and
 give feedback on any changes that affect them. This feedback step is
 especially crucial for a project without 100% test coverage (for new
 code and old)... Putting more restrictions on changes that can be made
 in releases doesn't seem to me to be the right fix, though,
 admittedly, numpy is a bit of a different beast than other projects. I
 would think you would want downstream projects not to wait 2 years to
 upgrade and skip a couple of minor releases.

 Skipper
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

+1. We've begun running pandas's test suite internally against NumPy
git master on Jenkins. It has already turned up bugs and behavior
changes in a few short weeks. We should definitely do this on a more
grand scale (especially since pandas 0.8.0 is now littered with hacks
around NumPy 1.6 datetime bugs. fortunately nothing was fatally broken
but it came close).

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Created NumPy 1.7.x branch

2012-06-21 Thread Wes McKinney
On Thu, Jun 21, 2012 at 2:49 PM, Ralf Gommers
ralf.gomm...@googlemail.com wrote:


 On Thu, Jun 21, 2012 at 5:25 PM, Travis Oliphant tra...@continuum.io
 wrote:

 I thought it was clear we were doing a 1.7 release before SciPy.   It
 seems pretty urgent that we get something out sooner than later.      I know
 there is never enough time to do all the things we want to do.

 There is time before the first Release candidate to make changes on the
 1.7.x branch.   If you want to make the changes on master, and just indicate
 the Pull requests, Ondrej can make sure they are added to the 1.7.x. branch
 by Monday.    We can also delay the first Release Candidate by a few days to
 next Wednesday and then bump everything 3 days if that will help.     There
 will be a follow-on 1.8 release before the end of the year --- so there is
 time to make changes for that release as well.    The next release will not
 take a year to get out, so we shouldn't feel pressured to get *everything*
 in this release.


 What about http://projects.scipy.org/numpy/ticket/2108? Someone needs to at
 least answer the question of how much of datetime is unusable on Windows
 with the current code. If that's not a lot then perhaps this is not a
 blocker, but we did consider it one until now.

pandas has become a heavy consumer of datetime64 recently, and we
haven't had any issues using VS2003 and VS2008, but haven't tested
heavily against NumPy compiled with mingw outside of the version
shipped in Enthought Python Distribution (the test suite passes fine,
last time I checked).

 Of the other tickets (http://projects.scipy.org/numpy/report/3) it would
 also be good to get an assessment of which ones are critical. Perhaps none
 of them are and the branch is in good shape for a release, but some of those
 segfaults would be nice to have fixed. Debian multi-arch support too, as
 discussed on this list recently.

 Ralf


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-17 Thread Wes McKinney
On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith n...@pobox.com wrote:
 On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney wesmck...@gmail.com wrote:
 It looks like the levels can only be strings. This is too limited for
 my needs. Why not support all possible NumPy dtypes? In pandas world,
 the levels can be any unique Index object

 It seems like there are three obvious options, from most to least general:

 1) Allow levels to be an arbitrary collection of hashable Python objects
 2) Allow levels to be a homogenous collection of objects of any
 arbitrary numpy dtype
 3) Allow levels to be chosen a few fixed types (strings and ints, I guess)

 I agree that (3) is a bit limiting. (1) is probably easier to
 implement than (2). (2) is the most general, since of course
 arbitrary Python object is a dtype. Is it useful to be able to
 restrict levels to be of homogenous type? The main difference between
 dtypes and python types is that (most) dtype scalars can be unboxed --
 is that substantively useful for levels?

 What is the story for NA values (NaL?) in a factor array? I code them
 as -1 in the labels, though you could use INT32_MAX or something. This
 is very important in the context of groupby operations.

 If we have a type restriction on levels (options (2) or (3) above),
 then how to handle out-of-bounds values is quite a problem, yeah. Once
 we have NA dtypes then I suppose we could use those, but we don't yet.
 It's tempting to just error out of any operation that encounters such
 values.

 Nathaniel: my experience (see blog posting above for a bit more) is
 that khash really crushes PyDict for two reasons: you can use it with
 primitive types and avoid boxing, and secondly you can preallocate.
 Its memory footprint with large hashtables is also a fraction of
 PyDict. The Python memory allocator is not problematic-- if you create
 millions of Python objects expect the RAM usage of the Python process
 to balloon absurdly.

 Right, I saw that posting -- it's clear that khash has a lot of
 advantages as internal temporary storage for a specific operation like
 groupby on unboxed types. But I can't tell whether those arguments
 still apply now that we're talking about a long-term storage
 representation for data that has to support a variety of operations
 (many of which would require boxing/unboxing, since the API is in
 Python), might or might not use boxed types, etc. Obviously this also
 depends on which of the three options above we go with -- unboxing
 doesn't even make sense for option (1).

 -n
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

I'm in favor of option #2 (a lite version of what I'm doing
currently-- I handle a few dtypes (PyObject, int64, datetime64,
float64), though you'd have to go the code-generation route for all
the dtypes to keep yourself sane if you do that.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Wes McKinney
On Wed, Jun 13, 2012 at 2:12 PM, Nathaniel Smith n...@pobox.com wrote:
 On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven bry...@continuum.io wrote:
 On 6/13/12 8:33 AM, Nathaniel Smith wrote:
 Hi Bryan,

 I skimmed over the diff:
     https://github.com/bryevdv/numpy/compare/master...enum
 It was a bit hard to read since it seems like about half the changes
 in that branch are datatime cleanups or something? I hope you'll
 separate those out -- it's much easier to review self-contained
 changes, and the more changes you roll together into a big lump, the
 more risk there is that they'll get lost all together.

 I'm not quite sure what happened there, my git skills are not advanced
 by any measure. I think the datetime changes are a much smaller fraction
 than fifty percent, but I will see what I can do to separate them out in
 the near future.

 Looking again, it looks like a lot of it is actually because when I
 asked github to show me the diff between your branch and master, it
 showed me the diff between your branch and *your repository's* version
 of master. And your branch is actually based off a newer version of
 'master' than you have in your repository. So, as far as git and
 github are concerned, all those changes that are included in
 your-branch's-base-master but not in your-repo's-master are new stuff
 that you did on your branch. Solution is just to do
  git push your github remote name master

  From the updated NEP I actually understand the use case for open
 types now, so that's good :-). But I don't think they're actually
 workable, so that's bad :-(. The use case, as I understand it, is for
 when you want to extend the levels set on the fly as you read through
 a file. The problem with this is that it produces a non-deterministic
 level ordering, where level 0 is whatever was seen first in the file,
 level 1 is whatever was seen second, etc. E.g., say I have a CSV file
 I read in:

      subject,initial_skill,skill_after_training
      1,LOW,HIGH
      2,LOW,LOW
      3,HIGH,HIGH
      ...

 With the scheme described in the NEP, my initial_skill dtype will have
 levels [LOW, HIGH], and by skill_after_training dtype will have
 levels [HIGH,LOW], which means that their storage will be
 incompatible, comparisons won't work (or will have to go through some

 I imagine users using the same open dtype object in both fields of the
 structure dtype used to read in the file, if both fields of the file
 contain the same categories. If they don't contain the same categories,
 they are incomparable in any case. I believe many users have this
 simpler use case where each field is a separate category, and they want
 to read them all individually, separately on the fly.  For these simple
 cases, it would just work. For your case example there would
 definitely be a documentation, examples, tutorials, education issue, to
 avoid the gotcha you describe.

 Yes, of course we *could* write the code to implement these open
 dtypes, and then write the documentation, examples, tutorials, etc. to
 help people work around their limitations. Or, we could just implement
 np.fromfile properly, which would require no workarounds and take less
 code to boot.

 nasty convert-to-string-and-back path), etc. Another situation where
 this will occur is if you have multiple data files in the same format;
 whether or not you're able to compare the data from them will depend
 on the order the data happens to occur in in each file. The solution
 is that whenever we automagically create a set of levels from some
 data, and the user hasn't specified any order, we should pick an order
 deterministically by sorting the levels. (This is also what R does.
 levels(factor(c(a, b))) -  a, b. levels(factor(c(b, a)))
 -  a, b.)

 A solution is to create the dtype object when reading in the first file,
 and to reuse that same dtype object when reading in subsequent files.
 Perhaps it's not ideal, but it does enable the work to be done.

 So would a proper implementation of np.fromfile that normalized the
 level ordering.

 Can you explain why you're using khash instead of PyDict? It seems to
 add a *lot* of complexity -- like it seems like you're using about as
 many lines of code just marshalling data into and out of the khash as
 I used for my old npenum.pyx prototype (not even counting all the
 extra work required to , and AFAICT my prototype has about the same
 amount of functionality as this. (Of course that's not entirely fair,
 because I was working in Cython... but why not work in Cython?) And
 you'll need to expose a Python dict interface sooner or later anyway,
 I'd think?

 I suppose I agree with the sentiment that the core of NumPy really ought
 to be less dependent on the Python C API, not more. I also think the
 khash API is pretty dead simple and straightforward, and the fact that
 it is contained in a singe header is attractive.  It's also quite
 performant in time and space. But if others disagree strongly, all of
 it's 

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Wes McKinney
On Wed, Jun 13, 2012 at 5:19 PM, Bryan Van de Ven bry...@continuum.io wrote:
 On 6/13/12 1:54 PM, Wes McKinney wrote:
 OK, I need to spend some time on this as it will directly impact me.
 Random thoughts here.

 It looks like the levels can only be strings. This is too limited for
 my needs. Why not support all possible NumPy dtypes? In pandas world,
 the levels can be any unique Index object (note, I'm going to change
 the name of the Factor class to Categorical before 0.8.0 final per
 discussion with Nathaniel):

 The current for-discussion prototype currently only supports strings. I
 had mentioned integral levels in the NEP but wanted to get more feedback
 first. It looks like you are using intervals as levels in things like
 qcut? This would add some complexity. I can think of a couple of
 possible approaches I will have to try a few of them out to see what
 would make the most sense.

 The API for constructing an enum/factor/categorical array from fixed
 levels and an array of labels seems somewhat weak to me. A very common
 scenario is to need to construct a factor from an array of integers
 with an associated array of levels:


 In [13]: labels
 Out[13]:
 array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
         7, 9, 7, 3, 3, 0, 4, 4])

 In [14]: levels
 Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

 In [15]: Factor(labels, levels)
 Out[15]:
 Factor:
 array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
         7, 9, 7, 3, 3, 0, 4, 4])
 Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

 I originally had a very similar interface in the NEP. I was persuaded by
 Mark that this would be redundant:

 In [10]: levels = np.factor(['a', 'b', 'c'])   # or levels =
 np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype
 In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels)
 Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
 dtype='factor({'c': 2, 'a': 0, 'b': 1})')

 This should also spell even more closely to your example as:

 labels.astype(levels)

 but I have not done much with casting yet, so this currently complains.
 However, would this satisfy your needs (modulo the separate question
 about more general integral or object levels)?

 What is the story for NA values (NaL?) in a factor array? I code them
 as -1 in the labels, though you could use INT32_MAX or something. This
 is very important in the context of groupby operations.
 I am just using INT32_MIN at the moment.
 Are the levels ordered (Nathaniel brought this up already looks like)?
 It doesn't look like it. That is also necessary. You also need to be

 They currently compare based on their value:

 In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
 np.factor({'c':0, 'b':1, 'a':2}))
 In [21]: arr
 Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
 dtype='factor({'c': 0, 'a': 2, 'b': 1})')
 In [22]: arr.sort()
 In [23]: arr
 Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'],
 dtype='factor({'c': 0, 'a': 2, 'b': 1})')


 able to sort the levels (which is a relabeling, I have lots of code in
 use for this). In the context of groupby in pandas, when processing a
 key (array of values) to a factor to be used for aggregating some
 data, you have the option of returning an object that has the levels
 as observed in the data or sorting. Sorting can obviously be very
 expensive depending on the number of groups in the data
 (http://wesmckinney.com/blog/?p=437). Example:

 from pandas import DataFrame
 from pandas.util.testing import rands
 import numpy as np

 df = DataFrame({'key' : [rands(10) for _ in xrange(10)] * 10,
             'data' : np.random.randn(100)})

 In [32]: timeit df.groupby('key').sum()
 1 loops, best of 3: 374 ms per loop

 In [33]: timeit df.groupby('key', sort=False).sum()
 10 loops, best of 3: 185 ms per loop

 The factorization time for the `key` column dominates the runtime;
 the factor is computed once then reused if you keep the GroupBy object
 around:

 In [36]: timeit grouped.sum()
 100 loops, best of 3: 6.05 ms per loop
 Just some numbers for comparison. Factorization times:

 In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
 In [42]: levels = np.factor(lets)
 In [43]: data = [lets[int(x)] for x in np.random.randn(100)]
 In [44]: %timeit np.array(data, levels)
 10 loops, best of 3: 137 ms per loop

 And retrieving group indicies/summing:

 In [8]: %timeit arr=='a'
 1000 loops, best of 3: 1.52 ms per

Re: [Numpy-discussion] Status of np.bincount

2012-05-03 Thread Wes McKinney
On Thu, May 3, 2012 at 12:51 PM, Tony Yu tsy...@gmail.com wrote:


 On Thu, May 3, 2012 at 9:57 AM, Robert Kern robert.k...@gmail.com wrote:

 On Thu, May 3, 2012 at 2:50 PM, Robert Elsner ml...@re-factory.de wrote:
 
  Am 03.05.2012 15:45, schrieb Robert Kern:
  On Thu, May 3, 2012 at 2:24 PM, Robert Elsner ml...@re-factory.de
  wrote:
  Hello Everybody,
 
  is there any news on the status of np.bincount with respect to big
  numbers? It seems I have just been bitten by #225. Is there an
  efficient
  way around? I found the np.histogram function painfully slow.
 
  Below a simple script, that demonstrates bincount failing with a
  memory
  error on big numbers
 
  import numpy as np
 
  x = np.array((30e9,)).astype(int)
  np.bincount(x)
 
 
  Any good idea how to work around it. My arrays contain somewhat 50M
  entries in the range from 0 to 30e9. And I would like to have them
  bincounted...
 
  You need a sparse data structure, then. Are you sure you even have
  duplicates?
 
  Anyways, I won't work out all of the details, but let me sketch
  something that might get you your answers. First, sort your array.
  Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x))
  to find the indices where each sorted value changes over to the next.
  The np.diff() of that should give you the size of each. Use np.unique
  to get the sorted unique values to match up with those sizes.
 
  Fixing all of the off-by-one errors and dealing with the boundary
  conditions correctly is left as an exercise for the reader.
 
 
  ?? I suspect that this mail was meant to end up in the thread about
  sparse array data?

 No, I am responding to you.


 Hi Robert (Elsner),

 Just to expand a bit on Robert Kern's explanation: Your problem is only
 partly related to Ticket #225. Even if that is fixed, you won't be able to
 call `bincount` with an array containing `30e9` unless you implement
 something using sparse arrays because `bincount` wants return an array
 that's `30e9 + 1` in length, which isn't going to happen.

 -Tony

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


hi Robert,

I suggest you try the value_counts instance method on pandas.Series:


In [9]: ints = np.random.randint(0, 30e9, size=10)

In [10]: all_ints = Series(ints.repeat(500))

In [11]: all_ints.value_counts()
Out[11]:
16420382874500
7147863689 500
4019588415 500
17462388002500
11956087699500
1498988500
3811318398 500
6333517765 500
16077665866500
17559759721500
5898309082 500
25213150655500
17877388690500
3122117900 500
6242860212 500
...
6344689036 500
16817048573500
16361777055500
4376828961 500
15910505187500
12051499627500
23857610954500
24557975709500
28135006018500
1661624653 500
6747702840 500
24601775145500
7290769930 500
9417075109 500
12071596222500
Length: 10

This method uses a C hash table and takes about 1 second to compute
the bin counts for 50mm entries and 100k unique values.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Issue Tracking

2012-05-02 Thread Wes McKinney
On Wed, May 2, 2012 at 9:48 AM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Tue, May 1, 2012 at 11:47 PM, Ralf Gommers ralf.gomm...@googlemail.com
 wrote:



 On Wed, May 2, 2012 at 1:48 AM, Pauli Virtanen p...@iki.fi wrote:

 01.05.2012 21:34, Ralf Gommers kirjoitti:
 [clip]
  At this point it's probably good to look again at the problems we want
  to solve:
  1. responsive user interface (must absolutely have)

 Now that it comes too late: with some luck, I've possibly hit on what
 was ailing the Tracs (max_diff_bytes configured too large). Let's see if
 things work better from now on...


 That's amazing - not only does it not give errors anymore, it's also an
 order of magnitude faster.


 So maybe we could just stick with trac. Performance was really the sticking
 point.

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


FWIW I'm pretty strongly in favor of GHI for NumPy/SciPy (I am going
to get involved in NumPy dev eventually, promise). While warty in some
of the places already mentioned, I have found it to be very
low-friction and low-annoyance in my own dev process (nearing 1000
issues closed in the last year in pandas). But there are fewer cooks
in the kitchen with pandas so perhaps this experience wouldn't be
identical with NumPy. The biggest benefit I've seen is community
involvement that you really wouldn't see if I were using a Trac or
something else hosted elsewhere. Users are on GitHub and it for some
reason gives people a feeling of engagement in the open source process
that I don't see anywhere else.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] datetime dtype possible regression

2012-04-29 Thread Wes McKinney
On Sat, Apr 28, 2012 at 11:18 AM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Sat, Apr 28, 2012 at 9:13 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Fri, Apr 27, 2012 at 4:57 PM, Robert Kern robert.k...@gmail.com
 wrote:
  On Fri, Apr 27, 2012 at 21:52, Travis Vaught tra...@vaught.net wrote:
  With NumPy 1.6.1 (from EPD 7.2-2) I get this behavior:
 
 
  ~
 
  In [1]: import numpy as np
 
  In [2]: schema = np.dtype({'names':['symbol', 'date', 'open', 'high',
  'low',
     ...:                        'close', 'volume', 'adjclose'],
     ...:                    'formats':['S8', 'M8', float, float, float,
  float,
     ...:                        float, float]})
 
  In [3]: data = [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0,
  5000,
  602.0),]
 
  In [4]: recdata = np.array(data, dtype=schema)
 
  In [5]: recdata
  Out[5]:
  array([ ('AAPL', datetime.datetime(2012, 4, 12, 0, 0), 600.0, 605.0,
  598.0,
  602.0, 5000.0, 602.0)],
        dtype=[('symbol', '|S8'), ('date', ('M8[us]', {})), ('open',
  'f8'),
  ('high', 'f8'), ('low', 'f8'), ('close', 'f8'), ('volume', 'f8'),
  ('adjclose', 'f8')])
 
 
  
 
 
  With numpy-1.7.0.dev_3cb783e I get this:
 
  import numpy as np
 
  schema =
 
  np.dtype({'names':['symbol','data','open','high','low','close','volume','adjclose'],
  'formats':['S8','M8',float,float,float,float,float,float]})
 
  data =  [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0,
  5000,
  602.0),]
 
  recdata = np.array(data, dtype=schema)
  Traceback (most recent call last):
    File stdin, line 1, in module
  ValueError: Cannot create a NumPy datetime other than NaT with generic
  units
 
  np.version.version
  '1.7.0.dev-3cb783e'
 
  
 
  Any hints about a regression I can check for? Or perhaps I missed an
  api
  change for specifying datetime dtypes?
 
  Judging from the error message, it looks like an intentional API change.
 
  --
  Robert Kern
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion

 Maybe this should be raised as a bug (where do we report NumPy bugs
 these days, still Trac?). As I'm moving to datetime64 in pandas if
 NumPy 1.6.1 data has unpickling issues on NumPy 1.7+ it's going to be
 very problematic.


 I was wondering what datetime you were using since the version in 1.6 had
 issues. Have you tested with both?

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Could you define issues? I haven't had a chance to make the library
compatible with both 1.6.1 and 1.7.0 yet (like Travis I'm using NumPy
1.6.1 from EPD); it's important though as pandas will be the first
widely used library I know of that will make heavy use of datetime64.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] datetime dtype possible regression

2012-04-28 Thread Wes McKinney
On Fri, Apr 27, 2012 at 4:57 PM, Robert Kern robert.k...@gmail.com wrote:
 On Fri, Apr 27, 2012 at 21:52, Travis Vaught tra...@vaught.net wrote:
 With NumPy 1.6.1 (from EPD 7.2-2) I get this behavior:


 ~

 In [1]: import numpy as np

 In [2]: schema = np.dtype({'names':['symbol', 'date', 'open', 'high', 'low',
    ...:                        'close', 'volume', 'adjclose'],
    ...:                    'formats':['S8', 'M8', float, float, float,
 float,
    ...:                        float, float]})

 In [3]: data = [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0, 5000,
 602.0),]

 In [4]: recdata = np.array(data, dtype=schema)

 In [5]: recdata
 Out[5]:
 array([ ('AAPL', datetime.datetime(2012, 4, 12, 0, 0), 600.0, 605.0, 598.0,
 602.0, 5000.0, 602.0)],
       dtype=[('symbol', '|S8'), ('date', ('M8[us]', {})), ('open', 'f8'),
 ('high', 'f8'), ('low', 'f8'), ('close', 'f8'), ('volume', 'f8'),
 ('adjclose', 'f8')])


 


 With numpy-1.7.0.dev_3cb783e I get this:

 import numpy as np

 schema =
 np.dtype({'names':['symbol','data','open','high','low','close','volume','adjclose'],
 'formats':['S8','M8',float,float,float,float,float,float]})

 data =  [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0, 5000,
 602.0),]

 recdata = np.array(data, dtype=schema)
 Traceback (most recent call last):
   File stdin, line 1, in module
 ValueError: Cannot create a NumPy datetime other than NaT with generic units

 np.version.version
 '1.7.0.dev-3cb783e'

 

 Any hints about a regression I can check for? Or perhaps I missed an api
 change for specifying datetime dtypes?

 Judging from the error message, it looks like an intentional API change.

 --
 Robert Kern
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

Maybe this should be raised as a bug (where do we report NumPy bugs
these days, still Trac?). As I'm moving to datetime64 in pandas if
NumPy 1.6.1 data has unpickling issues on NumPy 1.7+ it's going to be
very problematic.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Improving NumPy's indexing / subsetting / fancy indexing implementation

2012-04-05 Thread Wes McKinney
dear all,

I've routinely found that:

1) ndarray.take is up to 1 order of magnitude faster than fancy indexing
2) Hand-coded Cython boolean indexing is many times faster than ndarray indexing
3) putmask is significantly faster than ndarray indexing

For example, I  stumbled on this tonight:

straightforward cython code:

def row_bool_subset(ndarray[float64_t, ndim=2] values,
   ndarray[uint8_t, cast=True] mask):
   cdef:
   Py_ssize_t i, j, n, k, pos = 0
   ndarray[float64_t, ndim=2] out

   n, k = (object values).shape
   assert(n == len(mask))

   out = np.empty((mask.sum(), k), dtype=np.float64)

   for i in range(n):
   if mask[i]:
   for j in range(k):
   out[pos, j] = values[i, j]
   pos += 1

   return out

In [1]: values = randn(100, 4)

In [2]: mask = np.ones(100, dtype=bool)

In [3]: import pandas._sandbox as sbx

In [4]: result = sbx.row_bool_subset(values, mask)

In [5]: timeit result = sbx.row_bool_subset(values, mask)
100 loops, best of 3: 9.58 ms per loop

pure NumPy:

In [6]: timeit values[mask]
10 loops, best of 3: 81.6 ms per loop

Here's the kind of take performance problems that I routinely experience:

In [12]: values = randn(100, 4)
v
In [13]: values.shape
Out[13]: (100, 4)

In [14]: indexer = np.random.permutation(100)[:50]

In [15]: timeit values[indexer]
10 loops, best of 3: 70.7 ms per loop

In [16]: timeit values.take(indexer, axis=0)
100 loops, best of 3: 13.3 ms per loop

When I can spare the time in the future I will personally work on
these indexing routines in the C codebase, but I suspect that I'm not
the only one adversely affected by these kinds of performance
problems, and it might be worth thinking about a community effort to
split up the work of retooling these methods to be more performant. We
could use a tool like my vbench project (github.com/wesm/vbench) to
make a list of the performance benchmarks of interest (like the ones
above).

Unfortunately I am too time constrained at least for the next 6 months
to devote to a complete rewrite of the code in question. Possibly
sometime in 2013 if no one has gotten to it yet, but this seems like
someplace that we should be concerned about as the language
performance wars continue to rage.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] draft enum NEP

2012-03-11 Thread Wes McKinney
On Fri, Mar 9, 2012 at 5:48 PM, David Gowers (kampu) 00a...@gmail.com wrote:
 Hi,

 On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven bry...@continuum.io wrote:
 Hi all,

 I have started working on a NEP for adding an enumerated type to NumPy.
 It is on my GitHub:

     https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst

 It is still very rough, and incomplete in places. But I would like to
 get feedback sooner rather than later in order to refine it. In
 particular there are a few questions inline in the document that I would
 like input on. Any comments, suggestions, questions, concerns, etc. are
 very welcome.

 t = np.dtype('enum', map=(n,v))

 ^ Is this supposed to be indicating 'this is an enum with values
 ranging between n and v'? It could be a bit more clear.

 Is it possible to partially define an enum? That is, give the maximum
 and minimum values, and only some of the enumeration value:name
 mappings?
 For example, an enum where 0 means 'n/a', +n means 'Type A Object
 #(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map
 the non-scalar values, while having a way to avoid treating valid
 scalar values (eg +64) as out-of-range.
 Example of what I mean:

 t = np.dtype('enum[N_A:0]', range = (-127, 127))
 (defined values being printed as a string, undefined being printed as a 
 number.)

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

I'll have to think about this (a little brain dump here). I have many
use cases in pandas where this would be useful which are basically
direct translations of R's factor data type. Note that R always
coerces the levels (the unique values) AFAICT to string type. However,
mapping back to a well-dtyped array is important, too. So the
temptation might be to do something like this:

ndarray: dtype storage type (uint32 or something)
mapping : khash with type PyObject* - uint32

Now, one problem with this is that you want the mapping + dtype to be
invertible (otherwise you're left doing some type inference). The way
that I implement the mapping is to restrict the labeling to be from 0
to N - 1 which makes things easier. If we decide that having an
explicit value mapping

The nice thing about this is that the same set of core algorithms can
be used to fix numpy.unique. For example you would like to be able to
do:

enum_arr = np.enum(arr)

(this seems like a reasonable API to me) and that is a direct
equivalent of R's factor function. You need to be able to pass an
explicit ordering when calling the enum/factor function. If not
specified, you should have an option to either sort or not-- for
example suppose you convert an array of 1 million integers to enum but
you don't particularly care about the uniques (which could be very
large, up to the size of the array) being ordered (no need to pay N
log N for large N).

One nice thing about khash is that it can be serialized fairly easily.

Have you looked much at how I use enum-like ideas in pandas? It would
be great if I could offload some of this data algorithmic work to
NumPy.

We will want the enum data type to integrate with text file readers--
if you factorize as you go you can drastically reduce the memory
usage of a structured array (or pandas DataFrame) columns with
long-ish strings and relatively few unique values.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Migrating issues to GitHub

2012-02-26 Thread Wes McKinney
On Thu, Feb 16, 2012 at 4:32 PM, Ralf Gommers
ralf.gomm...@googlemail.com wrote:


 On Thu, Feb 16, 2012 at 10:20 PM, Thouis (Ray) Jones tho...@gmail.com
 wrote:

 On Thu, Feb 16, 2012 at 19:25, Ralf Gommers ralf.gomm...@googlemail.com
 wrote:
  In another thread Jira was proposed as an alternative to Trac. Can you
  point
  out some of its strengths and weaknesses, and tell us why you decided to
  move away from it?

 




 Jira reminded me of Java.


 OK, you convinced me:)

 Ralf

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Having just created a NumPy ticket
(http://projects.scipy.org/numpy/ticket/2065) I feel pretty strongly
about moving the issue tracker to GitHub--the lack of attachments is
easy to work around. I think it will help a lot for building more
community engagement with the development process. My experience using
it with pandas has been very positive-- I have churned through around
700 issues over the last 12 months and I've never felt like it's
gotten in my way (except for the occasional CSS/JS bugs in the web
UI).

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-25 Thread Wes McKinney
On Fri, Feb 24, 2012 at 9:07 AM, Erin Sheldon erin.shel...@gmail.com wrote:
 Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012:
 This is actually on my short-list as well --- it just didn't make it to the 
 list.

 In fact, we have someone starting work on it this week.  It is his
 first project so it will take him a little time to get up to speed on
 it, but he will contact Wes and work with him and report progress to
 this list.

 Integration with np.loadtxt is a high-priority.  I think loadtxt is
 now the 3rd or 4th text-reading interface I've seen in NumPy.  I
 have no interest in making a new one if we can avoid it.   But, we do
 need to make it faster with less memory overhead for simple cases like
 Wes describes.

 I'm willing to adapt my code if it is wanted, but at the same time I
 don't want to step on this person's toes.  Should I proceed?

 -e
 --
 Erin Scott Sheldon
 Brookhaven National Laboratory
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

That may work-- I haven't taken a look at the code but it is probably
a good starting point. We could create a new repo on the pydata GitHub
org (http://github.com/pydata) and use that as our point of
collaboration. I will hopefully be able to put some serious energy
into this this spring.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
dear all,

I haven't read all 180 e-mails, but I didn't see this on Travis's
initial list.

All of the existing flat file reading solutions I have seen are
not suitable for many applications, and they compare very unfavorably
to tools present in other languages, like R. Here are some of the
main issues I see:

- Memory usage: creating millions of Python objects when reading
  a large file results in horrendously bad memory utilization,
  which the Python interpreter is loathe to return to the
  operating system. Any solution using the CSV module (like
  pandas's parsers-- which are a lot faster than anything else I
  know of in Python) suffers from this problem because the data
  come out boxed in tuples of PyObjects. Try loading a 1,000,000
  x 20 CSV file into a structured array using np.genfromtxt or
  into a DataFrame using pandas.read_csv and you will immediately
  see the problem. R, by contrast, uses very little memory.

- Performance: post-processing of Python objects results in poor
  performance. Also, for the actual parsing, anything regular
  expression based (like the loadtable effort over the summer,
  all apologies to those who worked on it), is doomed to
  failure. I think having a tool with a high degree of
  compatibility and intelligence for parsing unruly small files
  does make sense though, but it's not appropriate for large,
  well-behaved files.

- Need to factorize: as soon as there is an enum dtype in
  NumPy, we will want to enable the file parsers for structured
  arrays and DataFrame to be able to factorize / convert to
  enum certain columns (for example, all string columns) during
  the parsing process, and not afterward. This is very important
  for enabling fast groupby on large datasets and reducing
  unnecessary memory usage up front (imagine a column with a
  million values, with only 10 unique values occurring). This
  would be trivial to implement using a C hash table
  implementation like khash.h

To be clear: I'm going to do this eventually whether or not it
happens in NumPy because it's an existing problem for heavy
pandas users. I see no reason why the code can't emit structured
arrays, too, so we might as well have a common library component
that I can use in pandas and specialize to the DataFrame internal
structure.

It seems clear to me that this work needs to be done at the
lowest level possible, probably all in C (or C++?) or maybe
Cython plus C utilities.

If anyone wants to get involved in this particular problem right
now, let me know!

best,
Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:08 PM, Travis Oliphant tra...@continuum.io wrote:
 This is actually on my short-list as well --- it just didn't make it to the 
 list.

 In fact, we have someone starting work on it this week.  It is his first 
 project so it will take him a little time to get up to speed on it, but he 
 will contact Wes and work with him and report progress to this list.

 Integration with np.loadtxt is a high-priority.  I think loadtxt is now the 
 3rd or 4th text-reading interface I've seen in NumPy.  I have no interest 
 in making a new one if we can avoid it.   But, we do need to make it faster 
 with less memory overhead for simple cases like Wes describes.

 -Travis

Yeah, what I envision is just an infrastructural parsing engine to
replace the pure Python guts of np.loadtxt, np.genfromtxt, and the csv
module + Cython guts of pandas.read_{csv, table, excel}. It needs to
be somewhat adaptable to some of the domain specific decisions of
structured arrays vs. DataFrames-- like I use Python objects for
strings, but one consequence of this is that I can intern strings
(only one PyObject per unique string value occurring) where structured
arrays cannot, so you get much better performance and memory usage
that way. That's soon to change, though, I gather, at which point I'll
almost definitely (!) move to pointer arrays instead of dtype=object
arrays.

- Wes



 On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:

 Hi,

 23.02.2012 20:32, Wes McKinney kirjoitti:
 [clip]
 To be clear: I'm going to do this eventually whether or not it
 happens in NumPy because it's an existing problem for heavy
 pandas users. I see no reason why the code can't emit structured
 arrays, too, so we might as well have a common library component
 that I can use in pandas and specialize to the DataFrame internal
 structure.

 If you do this, one useful aim could be to design the code such that it
 can be used in loadtxt, at least as a fast path for common cases. I'd
 really like to avoid increasing the number of APIs for text file loading.

 --
 Pauli Virtanen

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:19 PM, Warren Weckesser
warren.weckes...@enthought.com wrote:

 On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant tra...@continuum.io
 wrote:

 This is actually on my short-list as well --- it just didn't make it to
 the list.

 In fact, we have someone starting work on it this week.  It is his first
 project so it will take him a little time to get up to speed on it, but he
 will contact Wes and work with him and report progress to this list.

 Integration with np.loadtxt is a high-priority.  I think loadtxt is now
 the 3rd or 4th text-reading interface I've seen in NumPy.  I have no
 interest in making a new one if we can avoid it.   But, we do need to make
 it faster with less memory overhead for simple cases like Wes describes.

 -Travis



 I have a proof of concept CSV reader written in C (with a Cython
 wrapper).  I'll put it on github this weekend.

 Warren

Sweet, between this, Continuum folks, and me and my guys I think we
can come up with something good and suits all our needs. We should set
up some realistic performance test cases that we can monitor via
vbench (wesm/vbench) while we're work on the project.

- W




 On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:

  Hi,
 
  23.02.2012 20:32, Wes McKinney kirjoitti:
  [clip]
  To be clear: I'm going to do this eventually whether or not it
  happens in NumPy because it's an existing problem for heavy
  pandas users. I see no reason why the code can't emit structured
  arrays, too, so we might as well have a common library component
  that I can use in pandas and specialize to the DataFrame internal
  structure.
 
  If you do this, one useful aim could be to design the code such that it
  can be used in loadtxt, at least as a fast path for common cases. I'd
  really like to avoid increasing the number of APIs for text file
  loading.
 
  --
  Pauli Virtanen
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon erin.shel...@gmail.com wrote:
 Wes -

 I designed the recfile package to fill this need.  It might be a start.

 Some features:

    - the ability to efficiently read any subset of the data without
      loading the whole file.
    - reads directly into a recarray, so no overheads.
    - object oriented interface, mimicking recarray slicing.
    - also supports writing

 Currently it is fixed-width fields only.  It is C++, but wouldn't be
 hard to convert it C if that is a requirement.  Also, it works for
 binary or ascii.

    http://code.google.com/p/recfile/

 the trunk is pretty far past the most recent release.

 Erin Scott Sheldon

Can you relicense as BSD-compatible?

 Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
 dear all,

 I haven't read all 180 e-mails, but I didn't see this on Travis's
 initial list.

 All of the existing flat file reading solutions I have seen are
 not suitable for many applications, and they compare very unfavorably
 to tools present in other languages, like R. Here are some of the
 main issues I see:

 - Memory usage: creating millions of Python objects when reading
   a large file results in horrendously bad memory utilization,
   which the Python interpreter is loathe to return to the
   operating system. Any solution using the CSV module (like
   pandas's parsers-- which are a lot faster than anything else I
   know of in Python) suffers from this problem because the data
   come out boxed in tuples of PyObjects. Try loading a 1,000,000
   x 20 CSV file into a structured array using np.genfromtxt or
   into a DataFrame using pandas.read_csv and you will immediately
   see the problem. R, by contrast, uses very little memory.

 - Performance: post-processing of Python objects results in poor
   performance. Also, for the actual parsing, anything regular
   expression based (like the loadtable effort over the summer,
   all apologies to those who worked on it), is doomed to
   failure. I think having a tool with a high degree of
   compatibility and intelligence for parsing unruly small files
   does make sense though, but it's not appropriate for large,
   well-behaved files.

 - Need to factorize: as soon as there is an enum dtype in
   NumPy, we will want to enable the file parsers for structured
   arrays and DataFrame to be able to factorize / convert to
   enum certain columns (for example, all string columns) during
   the parsing process, and not afterward. This is very important
   for enabling fast groupby on large datasets and reducing
   unnecessary memory usage up front (imagine a column with a
   million values, with only 10 unique values occurring). This
   would be trivial to implement using a C hash table
   implementation like khash.h

 To be clear: I'm going to do this eventually whether or not it
 happens in NumPy because it's an existing problem for heavy
 pandas users. I see no reason why the code can't emit structured
 arrays, too, so we might as well have a common library component
 that I can use in pandas and specialize to the DataFrame internal
 structure.

 It seems clear to me that this work needs to be done at the
 lowest level possible, probably all in C (or C++?) or maybe
 Cython plus C utilities.

 If anyone wants to get involved in this particular problem right
 now, let me know!

 best,
 Wes
 --
 Erin Scott Sheldon
 Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:31 PM, Éric Depagne e...@depagne.org wrote:
 Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit :

 That would indeed be great. Reading large files is a real pain whatever the
 python method used.

 BTW, could you tell us what you mean by large files?

 cheers,
 Éric.

Reasonably wide CSV files with hundreds of thousands to millions of
rows. I have a separate interest in JSON handling but that is a
different kind of problem, and probably just a matter of forking
ultrajson and having it not produce Python-object-based data
structures.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:55 PM, Erin Sheldon erin.shel...@gmail.com wrote:
 Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012:
 Reasonably wide CSV files with hundreds of thousands to millions of
 rows. I have a separate interest in JSON handling but that is a
 different kind of problem, and probably just a matter of forking
 ultrajson and having it not produce Python-object-based data
 structures.

 As a benchmark, recfile can read an uncached file with 350,000 lines and
 32 columns in about 5 seconds.  File size ~220M

 -e
 --
 Erin Scott Sheldon
 Brookhaven National Laboratory

That's pretty good. That's faster than pandas's csv-module+Cython
approach almost certainly (but I haven't run your code to get a read
on how much my hardware makes a difference), but that's not shocking
at all:

In [1]: df = DataFrame(np.random.randn(35, 32))

In [2]: df.to_csv('/home/wesm/tmp/foo.csv')

In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
Wall time: 7.04 s

I must think that skipping the process of creating 11.2 mm Python
string objects and then individually converting each of them to float.

Note for reference (i'm skipping the first row which has the column
labels from above):

In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
0.48 s, total: 24.65 s
Wall time: 24.67 s

In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
delimiter=',', skiprows=1)
CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
Wall time: 11.32 s

In this last case for example, around 500 MB of RAM is taken up for an
array that should only be about 80-90MB. If you're a data scientist
working in Python, this is _not good_.

-W
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 4:20 PM, Erin Sheldon erin.shel...@gmail.com wrote:
 Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
 That's pretty good. That's faster than pandas's csv-module+Cython
 approach almost certainly (but I haven't run your code to get a read
 on how much my hardware makes a difference), but that's not shocking
 at all:

 In [1]: df = DataFrame(np.random.randn(35, 32))

 In [2]: df.to_csv('/home/wesm/tmp/foo.csv')

 In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
 CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
 Wall time: 7.04 s

 I must think that skipping the process of creating 11.2 mm Python
 string objects and then individually converting each of them to float.

 Note for reference (i'm skipping the first row which has the column
 labels from above):

 In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
 dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
 0.48 s, total: 24.65 s
 Wall time: 24.67 s

 In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
 delimiter=',', skiprows=1)
 CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
 Wall time: 11.32 s

 In this last case for example, around 500 MB of RAM is taken up for an
 array that should only be about 80-90MB. If you're a data scientist
 working in Python, this is _not good_.

 It might be good to compare on recarrays, which are a bit more complex.
 Can you try one of these .dat files?

    http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/

 The dtype is

 [('ra', 'f8'),
  ('dec', 'f8'),
  ('g1', 'f8'),
  ('g2', 'f8'),
  ('err', 'f8'),
  ('scinv', 'f8', 27)]

 --
 Erin Scott Sheldon
 Brookhaven National Laboratory

Forgot this one that is also widely used:

In [28]: %time recs =
matplotlib.mlab.csv2rec('/home/wesm/tmp/foo.csv', skiprows=1)
CPU times: user 65.16 s, sys: 0.30 s, total: 65.46 s
Wall time: 65.55 s

ok with one of those dat files and the dtype I get:

In [18]: %time arr =
np.genfromtxt('/home/wesm/Downloads/scat-05-000.dat', dtype=dtype,
skip_header=0, delimiter=' ')
CPU times: user 17.52 s, sys: 0.14 s, total: 17.66 s
Wall time: 17.67 s

difference not so stark in this case. I don't produce structured arrays, though

In [26]: %time arr =
read_table('/home/wesm/Downloads/scat-05-000.dat', header=None, sep='
')
CPU times: user 10.15 s, sys: 0.10 s, total: 10.25 s
Wall time: 10.26 s

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Index Array Performance

2012-02-14 Thread Wes McKinney
On Tue, Feb 14, 2012 at 4:03 AM, Francesc Alted franc...@continuum.io wrote:
 On Feb 14, 2012, at 1:50 AM, Wes McKinney wrote:
 [clip]
 But:

 In [40]: timeit hist[i, j]
 1 loops, best of 3: 32 us per loop

 So that's roughly 7-8x slower than a simple Cython method, so I
 sincerely hope it could be brought down to the sub 10 microsecond
 level with a little bit of work.

 I vaguely remember this has shown up before.  My hunch is that indexing in 
 NumPy is so powerful, that it has to check for a lot of different values for 
 indices (integers, tuples, lists, arrays…), and that it is all these checks 
 what is taking time.  Your Cython wrapper just assumed that the indices where 
 integers, so this is probably the reason why it is that much faster.

 This is not to say that indexing in NumPy could not be accelerated, but it 
 won't be trivial, IMO.


Given that __getitem__ and __setitem__ receive a 2-tuple of
1-dimensional integer arrays, should be pretty simple (dare I say
trivial? :) ) to optimize for this use case? The abysmal performance
of of __getitem__ and __setitem__ with 1d integer arrays is pretty
high on my list of annoyances with NumPy (especially when take and put
are so much faster), so you guys may see a pull request from me
whenever I can spare the time to hack on it (assuming you don't beat
me to it)!

 -- Francesc Alted



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Index Array Performance

2012-02-13 Thread Wes McKinney
On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver
m.oli...@jacobs-university.de wrote:
 Hi,

 I have a short piece of code where the use of an index array feels
 right, but incurs a severe performance penalty: It's about an order
 of magnitude slower than all other operations with arrays of that
 size.

 It comes up in a piece of code which is doing a large number of on
 the fly histograms via

  hist[i,j] += 1

 where i is an array with the bin index to be incremented and j is
 simply enumerating the histograms.  I attach a full short sample code
 below which shows how it's being used in context, and corresponding
 timeit output from the critical code section.

 Questions:

 - Is this a matter of principle, or due to an inefficient
  implementation?
 - Is there an equivalent way of doing it which is fast?

 Regards,
 Marcel

 =

 #! /usr/bin/env python
 # Plot the bifurcation diagram of the logistic map

 from pylab import *

 Nx = 800
 Ny = 600
 I = 5

 rmin = 2.5
 rmax = 4.0
 ymin = 0.0
 ymax = 1.0

 rr = linspace (rmin, rmax, Nx)
 x = 0.5*ones(rr.shape)
 hist = zeros((Ny+1,Nx), dtype=int)
 j = arange(Nx)

 dy = ymax/Ny

 def f(x):
    return rr*x*(1.0-x)

 for n in xrange(1000):
    x = f(x)

 for n in xrange(I):
    x = f(x)
    i = array(x/dy, dtype=int)
    hist[i,j] += 1

 figure()

 imshow(hist,
       cmap='binary',
       origin='lower',
       interpolation='nearest',
       extent=(rmin,rmax,ymin,ymax),
       norm=matplotlib.colors.LogNorm())

 xlabel ('$r$')
 ylabel ('$x$')

 title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$')

 show()

 

 In [4]: timeit y=f(x)
 1 loops, best of 3: 19.4 us per loop

 In [5]: timeit i = array(x/dy, dtype=int)
 1 loops, best of 3: 22 us per loop

 In [6]: timeit img[i,j] += 1
 1 loops, best of 3: 119 us per loop
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

This suggests to me that fancy indexing could be quite a bit faster in
this case:

In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop
In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1)
1 loops, best of 3: 20.6 us per loop

I wrote a simple Cython method

def fancy_inc(ndarray[int64_t, ndim=2] values,
  ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc):
cdef:
Py_ssize_t i, n = len(iarr)

for i in range(n):
values[iarr[i], jarr[i]] += inc

that does even faster

In [8]: timeit sbx.fancy_inc(hist, i, j, 1)
10 loops, best of 3: 4.85 us per loop

About 10% faster if bounds checking and wraparound are disabled.

Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list?

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Index Array Performance

2012-02-13 Thread Wes McKinney
On Mon, Feb 13, 2012 at 7:30 PM, Nathaniel Smith n...@pobox.com wrote:
 How would you fix it? I shouldn't speculate without profiling, but I'll be
 naughty. Presumably the problem is that python turns that into something
 like

 hist[i,j] = hist[i,j] + 1

 which means there's no way for numpy to avoid creating a temporary array. So
 maybe this could be fixed by adding a fused __inplace_add__ protocol to the
 language (and similarly for all the other inplace operators), but that seems
 really unlikely. Fundamentally this is just the sort of optimization
 opportunity you miss when you don't have a compiler with a global view;
 Fortran or c++ expression templates will win every time. Maybe pypy will fix
 it someday.

 Perhaps it would help to make np.add(hist, 1, out=hist, where=(i,j)) work?

 - N

Nope, don't buy it:

In [33]: timeit arr.__iadd__(1)
1000 loops, best of 3: 1.13 ms per loop

In [37]: timeit arr[:] += 1
1000 loops, best of 3: 1.13 ms per loop

- Wes

 On Feb 14, 2012 12:18 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver
 m.oli...@jacobs-university.de wrote:
  Hi,
 
  I have a short piece of code where the use of an index array feels
  right, but incurs a severe performance penalty: It's about an order
  of magnitude slower than all other operations with arrays of that
  size.
 
  It comes up in a piece of code which is doing a large number of on
  the fly histograms via
 
   hist[i,j] += 1
 
  where i is an array with the bin index to be incremented and j is
  simply enumerating the histograms.  I attach a full short sample code
  below which shows how it's being used in context, and corresponding
  timeit output from the critical code section.
 
  Questions:
 
  - Is this a matter of principle, or due to an inefficient
   implementation?
  - Is there an equivalent way of doing it which is fast?
 
  Regards,
  Marcel
 
  =
 
  #! /usr/bin/env python
  # Plot the bifurcation diagram of the logistic map
 
  from pylab import *
 
  Nx = 800
  Ny = 600
  I = 5
 
  rmin = 2.5
  rmax = 4.0
  ymin = 0.0
  ymax = 1.0
 
  rr = linspace (rmin, rmax, Nx)
  x = 0.5*ones(rr.shape)
  hist = zeros((Ny+1,Nx), dtype=int)
  j = arange(Nx)
 
  dy = ymax/Ny
 
  def f(x):
     return rr*x*(1.0-x)
 
  for n in xrange(1000):
     x = f(x)
 
  for n in xrange(I):
     x = f(x)
     i = array(x/dy, dtype=int)
     hist[i,j] += 1
 
  figure()
 
  imshow(hist,
        cmap='binary',
        origin='lower',
        interpolation='nearest',
        extent=(rmin,rmax,ymin,ymax),
        norm=matplotlib.colors.LogNorm())
 
  xlabel ('$r$')
  ylabel ('$x$')
 
  title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$')
 
  show()
 
  
 
  In [4]: timeit y=f(x)
  1 loops, best of 3: 19.4 us per loop
 
  In [5]: timeit i = array(x/dy, dtype=int)
  1 loops, best of 3: 22 us per loop
 
  In [6]: timeit img[i,j] += 1
  1 loops, best of 3: 119 us per loop
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion

 This suggests to me that fancy indexing could be quite a bit faster in
 this case:

 In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop
 In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1)
 1 loops, best of 3: 20.6 us per loop

 I wrote a simple Cython method

 def fancy_inc(ndarray[int64_t, ndim=2] values,
              ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc):
    cdef:
        Py_ssize_t i, n = len(iarr)

    for i in range(n):
        values[iarr[i], jarr[i]] += inc

 that does even faster

 In [8]: timeit sbx.fancy_inc(hist, i, j, 1)
 10 loops, best of 3: 4.85 us per loop

 About 10% faster if bounds checking and wraparound are disabled.

 Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list?

 - Wes
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Index Array Performance

2012-02-13 Thread Wes McKinney
On Mon, Feb 13, 2012 at 7:46 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Mon, Feb 13, 2012 at 7:30 PM, Nathaniel Smith n...@pobox.com wrote:
 How would you fix it? I shouldn't speculate without profiling, but I'll be
 naughty. Presumably the problem is that python turns that into something
 like

 hist[i,j] = hist[i,j] + 1

 which means there's no way for numpy to avoid creating a temporary array. So
 maybe this could be fixed by adding a fused __inplace_add__ protocol to the
 language (and similarly for all the other inplace operators), but that seems
 really unlikely. Fundamentally this is just the sort of optimization
 opportunity you miss when you don't have a compiler with a global view;
 Fortran or c++ expression templates will win every time. Maybe pypy will fix
 it someday.

 Perhaps it would help to make np.add(hist, 1, out=hist, where=(i,j)) work?

 - N

 Nope, don't buy it:

 In [33]: timeit arr.__iadd__(1)
 1000 loops, best of 3: 1.13 ms per loop

 In [37]: timeit arr[:] += 1
 1000 loops, best of 3: 1.13 ms per loop

 - Wes

Actually, apologies, I'm being silly (had too much coffee or
something). Python may be doing something nefarious with the hist[i,j]
+= 1. So both a get, add, then set, which is probably the problem.

 On Feb 14, 2012 12:18 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver
 m.oli...@jacobs-university.de wrote:
  Hi,
 
  I have a short piece of code where the use of an index array feels
  right, but incurs a severe performance penalty: It's about an order
  of magnitude slower than all other operations with arrays of that
  size.
 
  It comes up in a piece of code which is doing a large number of on
  the fly histograms via
 
   hist[i,j] += 1
 
  where i is an array with the bin index to be incremented and j is
  simply enumerating the histograms.  I attach a full short sample code
  below which shows how it's being used in context, and corresponding
  timeit output from the critical code section.
 
  Questions:
 
  - Is this a matter of principle, or due to an inefficient
   implementation?
  - Is there an equivalent way of doing it which is fast?
 
  Regards,
  Marcel
 
  =
 
  #! /usr/bin/env python
  # Plot the bifurcation diagram of the logistic map
 
  from pylab import *
 
  Nx = 800
  Ny = 600
  I = 5
 
  rmin = 2.5
  rmax = 4.0
  ymin = 0.0
  ymax = 1.0
 
  rr = linspace (rmin, rmax, Nx)
  x = 0.5*ones(rr.shape)
  hist = zeros((Ny+1,Nx), dtype=int)
  j = arange(Nx)
 
  dy = ymax/Ny
 
  def f(x):
     return rr*x*(1.0-x)
 
  for n in xrange(1000):
     x = f(x)
 
  for n in xrange(I):
     x = f(x)
     i = array(x/dy, dtype=int)
     hist[i,j] += 1
 
  figure()
 
  imshow(hist,
        cmap='binary',
        origin='lower',
        interpolation='nearest',
        extent=(rmin,rmax,ymin,ymax),
        norm=matplotlib.colors.LogNorm())
 
  xlabel ('$r$')
  ylabel ('$x$')
 
  title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$')
 
  show()
 
  
 
  In [4]: timeit y=f(x)
  1 loops, best of 3: 19.4 us per loop
 
  In [5]: timeit i = array(x/dy, dtype=int)
  1 loops, best of 3: 22 us per loop
 
  In [6]: timeit img[i,j] += 1
  1 loops, best of 3: 119 us per loop
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion

 This suggests to me that fancy indexing could be quite a bit faster in
 this case:

 In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop
 In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1)
 1 loops, best of 3: 20.6 us per loop

 I wrote a simple Cython method

 def fancy_inc(ndarray[int64_t, ndim=2] values,
              ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc):
    cdef:
        Py_ssize_t i, n = len(iarr)

    for i in range(n):
        values[iarr[i], jarr[i]] += inc

 that does even faster

 In [8]: timeit sbx.fancy_inc(hist, i, j, 1)
 10 loops, best of 3: 4.85 us per loop

 About 10% faster if bounds checking and wraparound are disabled.

 Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list?

 - Wes
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Index Array Performance

2012-02-13 Thread Wes McKinney
On Mon, Feb 13, 2012 at 7:48 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Mon, Feb 13, 2012 at 7:46 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Mon, Feb 13, 2012 at 7:30 PM, Nathaniel Smith n...@pobox.com wrote:
 How would you fix it? I shouldn't speculate without profiling, but I'll be
 naughty. Presumably the problem is that python turns that into something
 like

 hist[i,j] = hist[i,j] + 1

 which means there's no way for numpy to avoid creating a temporary array. So
 maybe this could be fixed by adding a fused __inplace_add__ protocol to the
 language (and similarly for all the other inplace operators), but that seems
 really unlikely. Fundamentally this is just the sort of optimization
 opportunity you miss when you don't have a compiler with a global view;
 Fortran or c++ expression templates will win every time. Maybe pypy will fix
 it someday.

 Perhaps it would help to make np.add(hist, 1, out=hist, where=(i,j)) work?

 - N

 Nope, don't buy it:

 In [33]: timeit arr.__iadd__(1)
 1000 loops, best of 3: 1.13 ms per loop

 In [37]: timeit arr[:] += 1
 1000 loops, best of 3: 1.13 ms per loop

 - Wes

 Actually, apologies, I'm being silly (had too much coffee or
 something). Python may be doing something nefarious with the hist[i,j]
 += 1. So both a get, add, then set, which is probably the problem.

 On Feb 14, 2012 12:18 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver
 m.oli...@jacobs-university.de wrote:
  Hi,
 
  I have a short piece of code where the use of an index array feels
  right, but incurs a severe performance penalty: It's about an order
  of magnitude slower than all other operations with arrays of that
  size.
 
  It comes up in a piece of code which is doing a large number of on
  the fly histograms via
 
   hist[i,j] += 1
 
  where i is an array with the bin index to be incremented and j is
  simply enumerating the histograms.  I attach a full short sample code
  below which shows how it's being used in context, and corresponding
  timeit output from the critical code section.
 
  Questions:
 
  - Is this a matter of principle, or due to an inefficient
   implementation?
  - Is there an equivalent way of doing it which is fast?
 
  Regards,
  Marcel
 
  =
 
  #! /usr/bin/env python
  # Plot the bifurcation diagram of the logistic map
 
  from pylab import *
 
  Nx = 800
  Ny = 600
  I = 5
 
  rmin = 2.5
  rmax = 4.0
  ymin = 0.0
  ymax = 1.0
 
  rr = linspace (rmin, rmax, Nx)
  x = 0.5*ones(rr.shape)
  hist = zeros((Ny+1,Nx), dtype=int)
  j = arange(Nx)
 
  dy = ymax/Ny
 
  def f(x):
     return rr*x*(1.0-x)
 
  for n in xrange(1000):
     x = f(x)
 
  for n in xrange(I):
     x = f(x)
     i = array(x/dy, dtype=int)
     hist[i,j] += 1
 
  figure()
 
  imshow(hist,
        cmap='binary',
        origin='lower',
        interpolation='nearest',
        extent=(rmin,rmax,ymin,ymax),
        norm=matplotlib.colors.LogNorm())
 
  xlabel ('$r$')
  ylabel ('$x$')
 
  title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$')
 
  show()
 
  
 
  In [4]: timeit y=f(x)
  1 loops, best of 3: 19.4 us per loop
 
  In [5]: timeit i = array(x/dy, dtype=int)
  1 loops, best of 3: 22 us per loop
 
  In [6]: timeit img[i,j] += 1
  1 loops, best of 3: 119 us per loop
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion

 This suggests to me that fancy indexing could be quite a bit faster in
 this case:

 In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop
 In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1)
 1 loops, best of 3: 20.6 us per loop

 I wrote a simple Cython method

 def fancy_inc(ndarray[int64_t, ndim=2] values,
              ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc):
    cdef:
        Py_ssize_t i, n = len(iarr)

    for i in range(n):
        values[iarr[i], jarr[i]] += inc

 that does even faster

 In [8]: timeit sbx.fancy_inc(hist, i, j, 1)
 10 loops, best of 3: 4.85 us per loop

 About 10% faster if bounds checking and wraparound are disabled.

 Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list?

 - Wes
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


But:

In [40]: timeit hist[i, j]
1 loops, best of 3: 32 us per loop

So that's roughly 7-8x slower than a simple Cython method, so I
sincerely hope it could be brought down to the sub 10 microsecond
level with a little bit of work

Re: [Numpy-discussion] fast method to to count a particular value in a large matrix

2012-02-05 Thread Wes McKinney
On Sun, Feb 5, 2012 at 7:02 PM, David Verelst david.vere...@gmail.com wrote:
 Just out of curiosity, what speed-up factor did you achieve?

 Regards,
 David

 On 04/02/12 22:20, Naresh wrote:
 Warren Weckesserwarren.weckesserat  enthought.com  writes:


 On Sat, Feb 4, 2012 at 2:35 PM, Benjamin Rootben.rootat  ou.edu  wrote:


 On Saturday, February 4, 2012, Naresh Painpaiat  uark.edu  wrote:  I am
 somewhat new to Python (been coding with Matlab mostly). I am trying to
 simplify (and expedite) a piece of code that is currently a bottleneck in a
 larger
 code.  I have a large array (7000 rows x 4500 columns) titled say, abc, 
 and
 I am trying  to find a fast method to count the number of instances of each
 unique value within  it. All unique values are stored in a variable, say,
 unique_elem. My current code

 is as follows:  import numpy as np  #allocate space for storing element
 count  elem_count = zeros((len(unique_elem),1))  #loop through and count 
 number
 of unique_elem  for i in range(len(unique_elem)):

     elem_count[i]= np.sum(reduce(np.logical_or,(abc== x for x
 in [unique_elem[i]])))  This loop is bottleneck because I have about 850 
 unique
 elements and it takes  about 9-10 minutes. Can you suggest a faster way to 
 do
 this?

 Thank you,  Naresh
 no.unique() can return indices and reverse indices.  It would be trivial to
 histogram the reverse indices using np.histogram().

 Instead of histogram(), you can use bincount() on the inverse indices:u, 
 inv =
 np.unique(abc, return_inverse=True)n = np.bincount(inv)u will be an array of 
 the
 unique elements, and n will be an array of the corresponding number of
 occurrences.Warren



 ___
 NumPy-Discussion mailing list
 NumPy-Discussionat  scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 The histogram() solution works perfect since unique_elem is ordered. I
 appreciate everyone's help.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

np.histogram works pretty well. I'm getting speeds something like 1300
ms on float64 data. A hash table-based solution is faster (no big
surprise here), about 800ms so in the ballpark of 40% faster.

Whenever I get motivated enough I'm going to make a pull request on
NumPy with something like khash.h and start fixing all the O(N log N)
algorithms floating around that ought to be O(N). NumPy should really
have a match function similar to R's and a lot of other things.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy 'groupby'

2012-01-11 Thread Wes McKinney
On Wed, Jan 11, 2012 at 7:05 AM, Neal Becker ndbeck...@gmail.com wrote:
 Michael Hull wrote:

 Hi Everyone,
 First off, thanks for all your hard work on numpy, its a really great help!
 I was wondering if there was a standard 'groupby' in numpy, that
 similar to that in itertools.
 I know its not hard to write with np.diff, but I have found myself
 writing it on more than a couple of occasions, and wondered if
  there was a 'standarised' version I was missing out on??
 Thanks,


 Mike

 I've played with groupby in pandas.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

I agree (unsurprisingly) that pandas is your best bet:

http://pandas.sourceforge.net/groupby.html

I've had it on my TODO list to extend the pandas groupby engine (which
has grown fairly sophisticated) to work with generic ndarrays and
record arrays:

https://github.com/wesm/pandas/issues/123

It shouldn't actually be that hard for most simple cases. I could
imagine the results of a groupby being somewhat difficult to interpret
without axis labeling/indexing, though.

cheers,
Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PyInt and Numpy's int64 conversion

2012-01-06 Thread Wes McKinney
On Wed, Jan 4, 2012 at 5:22 AM, xantares 09 xantare...@hotmail.com wrote:


 From: wesmck...@gmail.com
 Date: Sat, 24 Dec 2011 19:51:06 -0500

 To: numpy-discussion@scipy.org
 Subject: Re: [Numpy-discussion] PyInt and Numpy's int64 conversion

 On Sat, Dec 24, 2011 at 3:11 AM, xantares 09 xantare...@hotmail.com
 wrote:
 
 
  From: wesmck...@gmail.com
  Date: Fri, 23 Dec 2011 12:31:45 -0500
  To: numpy-discussion@scipy.org
  Subject: Re: [Numpy-discussion] PyInt and Numpy's int64 conversion
 
 
  On Fri, Dec 23, 2011 at 4:37 AM, xantares 09 xantare...@hotmail.com
  wrote:
   Hi,
  
   I'm using Numpy from the C python api side while tweaking my SWIG
   interface
   to work with numpy array types.
   I want to convert a numpy array of integers (whose elements are
   numpy's
   'int64')
   The problem is that it this int64 type is not compatible with the
   standard
   python integer type:
   I cannot use PyInt_Check, and PyInt_AsUnsignedLongMask to check and
   convert
   from int64: basically PyInt_Check returns false.
   I checked the numpy config header and npy_int64 does have a size of
   8o,
   which should be the same as int on my x86_64.
   What is the correct way to do that ?
   I checked for a Int64_Check function and didn't find any in numpy
   headers.
  
   Regards,
  
   x.
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
 
  hello,
 
  I think you'll want to use the C macro PyArray_IsIntegerScalar, e.g.
  in pandas I have the following function exposed to my Cython code:
 
  PANDAS_INLINE int
  is_integer_object(PyObject* obj) {
  return PyArray_IsIntegerScalar(obj);
  }
 
  last time I checked that macro detects Python int, long, and all of
  the NumPy integer hierarchy (int8, 16, 32, 64). If you ONLY want to
  check for int64 I am not 100% sure the best way.
 
  - Wes
 
  Hi,
 
  Thank you for your reply !
 
  That's the thing : I want to check/convert every type of integer,
  numpy's
  int64 and also python standard ints.
  Is there a way to avoid to use only the python api ? ( and avoid to
  depend
  on numpy's PyArray_* functions )
 
  Regards.
 
  x.
 
 
 
 
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 No. All of the PyTypeObject objects for the NumPy array scalars are
 explicitly part of the NumPy C API so you have no choice but to depend
 on that (to get the best performance). If you want to ONLY check for
 int64 at the C API level, I did a bit of digging and the relevant type
 definitions are in


 https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/npy_common.h

 so you'll want to do:

 int is_int64(PyObject* obj){
 return PyObject_TypeCheck(obj, PyInt64ArrType_Type);
 }

 and that will *only* detect np.int64

 - Wes

 Ok many thanks !

 One last thing, do you happen to know how to actually convert an np int64 to
 a C int ?

 - x.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Not sure off-hand. You'll have to look at the NumPy scalar API in the C code
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum type

2012-01-03 Thread Wes McKinney
On Tue, Jan 3, 2012 at 12:46 PM, Ognen Duzlevski og...@enthought.com wrote:
 Hello,

 I am playing with adding an enum dtype to numpy (to get my feet wet in
 numpy really). I have looked at the
 https://github.com/martinling/numpy_quaternion and I feel comfortable
 with my understanding of adding a simple type to numpy in technical
 terms.

 I am mostly a C programmer and have programmed in Python but not at
 the level where my code wcould be considered pretty or maybe even
 pythonic. I know enums from C and have browsed around a few python
 enum implementations online. Most of them use hash tables or lists to
 associate names to numbers - these approaches just feel heavy to me.

 What would be a proper numpy approach to this? I am looking mostly
 for direction and advice as I would like to do the work myself :-)

 Any input appreciated :-)
 Ognen
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

You should use a hash table internally in my opinion. I've started
using khash from klib (https://github.com/attractivechaos/klib) which
has excellent memory usage (more than 50% less than Python dict with
large hash tables) and good performance characteristics. With the enum
dtype you can avoid reference counting with primitive types, not sure
about object dtype. If enum arrays are mutable this will be very
tricky.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum type

2012-01-03 Thread Wes McKinney
On Tue, Jan 3, 2012 at 1:06 PM, Jim Vickroy jim.vick...@noaa.gov wrote:
 On 1/3/2012 10:46 AM, Ognen Duzlevski wrote:
 Hello,

 I am playing with adding an enum dtype to numpy (to get my feet wet in
 numpy really). I have looked at the
 https://github.com/martinling/numpy_quaternion and I feel comfortable
 with my understanding of adding a simple type to numpy in technical
 terms.

 I am mostly a C programmer and have programmed in Python but not at
 the level where my code wcould be considered pretty or maybe even
 pythonic. I know enums from C and have browsed around a few python
 enum implementations online. Most of them use hash tables or lists to
 associate names to numbers - these approaches just feel heavy to me.

 What would be a proper numpy approach to this? I am looking mostly
 for direction and advice as I would like to do the work myself :-)

 Any input appreciated :-)
 Ognen

 Does enumerate
 (http://docs.python.org/library/functions.html#enumerate) work for you?

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

That's not exactly what he means. The R lingo for this concept is
factor or a bit more common categorical variable:

http://stat.ethz.ch/R-manual/R-patched/library/base/html/factor.html

FWIW R's factor type is implemented using hash tables. I do the same in pandas.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] GSOC

2011-12-29 Thread Wes McKinney
On Thu, Dec 29, 2011 at 4:36 PM, Ralf Gommers
ralf.gomm...@googlemail.com wrote:


 On Thu, Dec 29, 2011 at 9:50 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:

 Hi All,

 I thought I'd raise this topic just to get some ideas out there. At the
 moment I see two areas that I'd like to see addressed.

 Documentation editor. This would involve looking at the generated
 documentation and it's organization/coverage as well such things as style
 and maybe reviewing stuff on the documentation site. This would be more
 technical writing than coding.
 Test coverage. There are a lot of areas of numpy that are not well tested
 as well as some tests that are still doc tests and should probably be
 updated. This is a substantial amount of work and would require some
 familiarity with numpy as well as a willingness to ping developers for
 clarification of some topics.

 Thoughts?

 First thought: very useful, but probably not GSOC topics by themselves.

 For a very good student, I'd think topics like implementing NA bit masks or
 improved user-defined dtypes would be interesting. In SciPy there's also a
 lot to do, and that's probably a better project for students who prefer to
 work in Python.

 Thanks for bringing this up. Last year we missed the boat, it would be good
 to get one or more slots this year.

 Ralf



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Along with test coverage, have any of you considered any systematic
monitoring of NumPy performance? With all of the extensive refactoring
/ work on the internals, it would be useful to keep an eye on things
in case of any performance regressions. I mention this because I
started a little prototype project (http://github.com/wesm/vbench) for
doing exactly that for my own development purposes-- it's already
proved extremely useful.

Anyway, just a thought. I'm sure a motivated student could spend a
whole summer writing unit tests for NumPy and nothing else.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PyInt and Numpy's int64 conversion

2011-12-24 Thread Wes McKinney
On Sat, Dec 24, 2011 at 3:11 AM, xantares 09 xantare...@hotmail.com wrote:


 From: wesmck...@gmail.com
 Date: Fri, 23 Dec 2011 12:31:45 -0500
 To: numpy-discussion@scipy.org
 Subject: Re: [Numpy-discussion] PyInt and Numpy's int64 conversion


 On Fri, Dec 23, 2011 at 4:37 AM, xantares 09 xantare...@hotmail.com
 wrote:
  Hi,
 
  I'm using Numpy from the C python api side while tweaking my SWIG
  interface
  to work with numpy array types.
  I want to convert a numpy array of integers (whose elements are numpy's
  'int64')
  The problem is that it this int64 type is not compatible with the
  standard
  python integer type:
  I cannot use PyInt_Check, and PyInt_AsUnsignedLongMask to check and
  convert
  from int64: basically PyInt_Check returns false.
  I checked the numpy config header and npy_int64 does have a size of 8o,
  which should be the same as int on my x86_64.
  What is the correct way to do that ?
  I checked for a Int64_Check function and didn't find any in numpy
  headers.
 
  Regards,
 
  x.
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 hello,

 I think you'll want to use the C macro PyArray_IsIntegerScalar, e.g.
 in pandas I have the following function exposed to my Cython code:

 PANDAS_INLINE int
 is_integer_object(PyObject* obj) {
 return PyArray_IsIntegerScalar(obj);
 }

 last time I checked that macro detects Python int, long, and all of
 the NumPy integer hierarchy (int8, 16, 32, 64). If you ONLY want to
 check for int64 I am not 100% sure the best way.

 - Wes

 Hi,

 Thank you for your reply !

 That's the thing : I want to check/convert every type of integer, numpy's
 int64 and also python standard ints.
 Is there a way to avoid to use only the python api ? ( and avoid to depend
 on numpy's PyArray_* functions )

 Regards.

 x.






 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


No. All of the PyTypeObject objects for the NumPy array scalars are
explicitly part of the NumPy C API so you have no choice but to depend
on that (to get the best performance). If you want to ONLY check for
int64 at the C API level, I did a bit of digging and the relevant type
definitions are in

https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/npy_common.h

so you'll want to do:

int is_int64(PyObject* obj){
  return PyObject_TypeCheck(obj, PyInt64ArrType_Type);
}

and that will *only* detect np.int64

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PyInt and Numpy's int64 conversion

2011-12-23 Thread Wes McKinney
On Fri, Dec 23, 2011 at 4:37 AM, xantares 09 xantare...@hotmail.com wrote:
 Hi,

 I'm using Numpy from the C python api side while tweaking my SWIG interface
 to work with numpy array types.
 I want to convert a numpy array of integers (whose elements are numpy's
 'int64')
 The problem is that it this int64 type is not compatible with the standard
 python integer type:
 I cannot use PyInt_Check, and PyInt_AsUnsignedLongMask to check and convert
 from int64: basically PyInt_Check returns false.
 I checked the numpy config header and npy_int64 does have a size of 8o,
 which should be the same as int on my x86_64.
 What is the correct way to do that ?
 I checked for a Int64_Check function and didn't find any in numpy headers.

 Regards,

 x.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


hello,

I think you'll want to use the C macro PyArray_IsIntegerScalar, e.g.
in pandas I have the following function exposed to my Cython code:

PANDAS_INLINE int
is_integer_object(PyObject* obj) {
  return PyArray_IsIntegerScalar(obj);
}

last time I checked that macro detects Python int, long, and all of
the NumPy integer hierarchy (int8, 16, 32, 64). If you ONLY want to
check for int64 I am not 100% sure the best way.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Wes McKinney
On Mon, Dec 12, 2011 at 12:34 PM, Warren Weckesser
warren.weckes...@enthought.com wrote:


 On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker chris.bar...@noaa.gov
 wrote:

 On 12/11/11 8:40 AM, Ralf Gommers wrote:
  On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov
      * If we have a good, fast ascii (or unicode?) to array reader,
  hopefully
      it could be leveraged for use in the more complex cases. So that
  rather
      than genfromtxt() being written from scratch, it would be a wrapper
      around the lower-level reader.
 
  You seem to be contradicting yourself here. The more complex cases are
  Wes' 10% and why genfromtxt is so hairy internally. There's always a
  trade-off between speed and handling complex corner cases. You want
  both.

 I don't think the version in my mind is contradictory (Not quite).

 What I'm imagining is that a good, fast ascii to numpy array reader
 could read a whole table in at once (the common, easy, fast, case), but
 it could also be used to read snippets of a file in at a time, which
 could be leveraged to handle many of the more complex cases.

 I suppose there will always be cases where the user needs to write their
 own converter from string to dtype, and there is simply no way to
 leverage what I'm imagining to supported that.

 Hmm, maybe there is -- for instance, if a record consisted off mostly
 standard, easy-to-parse, numbers, but one field was some weird text that
 needed custom parsing, we could read it as a dtype, with a string for
 that one weird field, and that could be converted in a post-processing
 step.

 Maybe that wouldn't be any faster or easier, but it could be done...

 Anyway, whether you can leverage it for the full-featured version or
 not, I do think there is call for a good, fast, 90% case text file parser.


 Would anyone like to join/form a small working group to work on this?

 Wes, I'd like to see your Cython version -- maybe a starting point?

 -Chris



 I'm also working on a faster text file reader, so count me in.  I've been
 experimenting in both C and Cython.   I'll put it on github as soon as I
 can.

 Warren





 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Cool, Warren, I look forward to seeing it. I'm hopeful we can craft a
performant tool that will meet the needs of of many projects (NumPy,
pandas, etc.)...
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy 1.7.0 release?

2011-12-06 Thread Wes McKinney
On Tue, Dec 6, 2011 at 4:11 PM, Ralf Gommers
ralf.gomm...@googlemail.com wrote:


 On Mon, Dec 5, 2011 at 8:43 PM, Ralf Gommers ralf.gomm...@googlemail.com
 wrote:

 Hi all,

 It's been a little over 6 months since the release of 1.6.0 and the NA
 debate has quieted down, so I'd like to ask your opinion on the timing of
 1.7.0. It looks to me like we have a healthy amount of bug fixes and small
 improvements, plus three larger chucks of work:

 - datetime
 - NA
 - Bento support

 My impression is that both datetime and NA are releasable, but should be
 labeled tech preview or something similar, because they may still see
 significant changes. Please correct me if I'm wrong.

 There's still some maintenance work to do and pull requests to merge, but
 a beta release by Christmas should be feasible.


 To be a bit more detailed here, these are the most significant pull requests
 / patches that I think can be merged with a limited amount of work:
 meshgrid enhancements: http://projects.scipy.org/numpy/ticket/966
 sample_from function: https://github.com/numpy/numpy/pull/151
 loadtable function: https://github.com/numpy/numpy/pull/143

 Other maintenance things:
 - un-deprecate putmask
 - clean up causes of DType strings 'O4' and 'O8' are deprecated...
 - fix failing einsum and polyfit tests
 - update release notes

 Cheers,
 Ralf


 What do you all think?


 Cheers,
 Ralf



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


This isn't the place for this discussion but we should start talking
about building a *high performance* flat file loading solution with
good column type inference and sensible defaults, etc. It's clear that
loadtable is aiming for highest compatibility-- for example I can read
a 2800x30 file in  50 ms with the read_table / read_csv functions I
wrote myself recent in Cython (compared with loadtable taking  1s as
quoted in the pull request), but I don't handle European decimal
formats and lots of other sources of unruliness. I personally don't
believe in sacrificing an order of magnitude of performance in the 90%
case for the 10% case-- so maybe it makes sense to have two functions
around: a superfast custom CSV reader for well-behaved data, and a
slower, but highly flexible, function like loadtable to fall back on.
I think R has two functions read.csv and read.csv2, where read.csv2 is
capable of dealing with things like European decimal format.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] consensus (was: NA masks in the next numpy release?)

2011-10-29 Thread Wes McKinney
On Fri, Oct 28, 2011 at 9:32 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Fri, Oct 28, 2011 at 6:45 PM, Wes McKinney wesmck...@gmail.com wrote:

 On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root ben.r...@ou.edu wrote:
 
 
  On Friday, October 28, 2011, Matthew Brett matthew.br...@gmail.com
  wrote:
  Hi,
 
  On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers
  ralf.gomm...@googlemail.com wrote:
 
 
  On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett
  matthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris
  charlesr.har...@gmail.com wrote:
  
  
   On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett
   matthew.br...@gmail.com
   wrote:
  
   Hi,
  
   On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett
   matthew.br...@gmail.com
   wrote:
Hi,
   
On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
   
   
On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith
n...@pobox.com
wrote:
   
On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant
oliph...@enthought.com
wrote:
 I think Nathaniel and Matthew provided very
 specific feedback that was helpful in understanding other
 perspectives
 of a
 difficult problem.     In particular, I really wanted
 bit-patterns
 implemented.    However, I also understand that Mark did
 quite
 a
 bit
 of
 work
 and altered his original designs quite a bit in response to
 community
 feedback.   I wasn't a major part of the pull request
 discussion,
 nor
 did I
 merge the changes, but I support Charles if he reviewed the
 code
 and
 felt
 like it was the right thing to do.  I likely would have done
 the
 same
 thing
 rather than let Mark Wiebe's work languish.
   
My connectivity is spotty this week, so I'll stay out of the
technical
discussion for now, but I want to share a story.
   
Maybe a year ago now, Jonathan Taylor and I were debating what
the
best API for describing statistical models would be -- whether
we
wanted something like R's formulas (which I supported), or
another
approach based on sympy (his idea). To summarize, I thought
his
API
was confusing, pointlessly complicated, and didn't actually
solve
the
problem; he thought R-style formulas were superficially
simpler
but
hopelessly confused and inconsistent underneath. Now,
obviously,
I
was
right and he was wrong. Well, obvious to me, anyway... ;-) But
it
wasn't like I could just wave a wand and make his arguments go
away,
no I should point out that the implementation hasn't - as far
as
I can
  see - changed the discussion.  The discussion was about the API.
  Implementations are useful for agreed APIs because they can point out
  where the API does not make sense or cannot be implemented.  In this
  case, the API Mark said he was going to implement - he did implement -
  at least as far as I can see.  Again, I'm happy to be corrected.
 
  In saying that we are insisting on our way, you are saying,
  implicitly,
  'I
  am not going to negotiate'.
 
  That is only your interpretation. The observation that Mark
  compromised
  quite a bit while you didn't seems largely correct to me.
 
  The problem here stems from our inability to work towards agreement,
  rather than standing on set positions.  I set out what changes I think
  would make the current implementation OK.  Can we please, please have
  a discussion about those points instead of trying to argue about who
  has given more ground.
 
  That commitment would of course be good. However, even if that were
  possible
  before writing code and everyone agreed that the ideas of you and
  Nathaniel
  should be implemented in full, it's still not clear that either of you
  would
  be willing to write any code. Agreement without code still doesn't
  help
  us
  very much.
 
  I'm going to return to Nathaniel's point - it is a highly valuable
  thing to set ourselves the target of resolving substantial discussions
  by consensus.   The route you are endorsing here is 'implementor
  wins'.   We don't need to do it that way.  We're a mature sensible
  bunch of adults who can talk out the issues until we agree they are
  ready for implementation, and then implement.  That's all Nathaniel is
  saying.  I think he's obviously right, and I'm sad that it isn't as
  clear to y'all as it is to me.
 
  Best,
 
  Matthew
 
 
  Everyone, can we please not do this?! I had enough of adults doing
  finger
  pointing back over the summer during the whole debt ceiling debate.  I
  think
  we can all agree that we are better than the US congress?
 
  Forget about rudeness or decision processes.
 
  I will start by saying that I am willing to separate ignore and absent,
  but
  only on the write side of things.  On read, I want a single way to
  identify
  the missing values

Re: [Numpy-discussion] consensus (was: NA masks in the next numpy release?)

2011-10-28 Thread Wes McKinney
On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root ben.r...@ou.edu wrote:


 On Friday, October 28, 2011, Matthew Brett matthew.br...@gmail.com wrote:
 Hi,

 On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers
 ralf.gomm...@googlemail.com wrote:


 On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett matthew.br...@gmail.com
 wrote:

 Hi,

 On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
  On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett
  matthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett
  matthew.br...@gmail.com
  wrote:
   Hi,
  
   On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
   charlesr.har...@gmail.com wrote:
  
  
   On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith n...@pobox.com
   wrote:
  
   On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant
   oliph...@enthought.com
   wrote:
I think Nathaniel and Matthew provided very
specific feedback that was helpful in understanding other
perspectives
of a
difficult problem.     In particular, I really wanted
bit-patterns
implemented.    However, I also understand that Mark did quite
a
bit
of
work
and altered his original designs quite a bit in response to
community
feedback.   I wasn't a major part of the pull request
discussion,
nor
did I
merge the changes, but I support Charles if he reviewed the
code
and
felt
like it was the right thing to do.  I likely would have done
the
same
thing
rather than let Mark Wiebe's work languish.
  
   My connectivity is spotty this week, so I'll stay out of the
   technical
   discussion for now, but I want to share a story.
  
   Maybe a year ago now, Jonathan Taylor and I were debating what
   the
   best API for describing statistical models would be -- whether we
   wanted something like R's formulas (which I supported), or
   another
   approach based on sympy (his idea). To summarize, I thought his
   API
   was confusing, pointlessly complicated, and didn't actually solve
   the
   problem; he thought R-style formulas were superficially simpler
   but
   hopelessly confused and inconsistent underneath. Now, obviously,
   I
   was
   right and he was wrong. Well, obvious to me, anyway... ;-) But it
   wasn't like I could just wave a wand and make his arguments go
   away,
   no I should point out that the implementation hasn't - as far as
   I can
 see - changed the discussion.  The discussion was about the API.
 Implementations are useful for agreed APIs because they can point out
 where the API does not make sense or cannot be implemented.  In this
 case, the API Mark said he was going to implement - he did implement -
 at least as far as I can see.  Again, I'm happy to be corrected.

 In saying that we are insisting on our way, you are saying, implicitly,
 'I
 am not going to negotiate'.

 That is only your interpretation. The observation that Mark compromised
 quite a bit while you didn't seems largely correct to me.

 The problem here stems from our inability to work towards agreement,
 rather than standing on set positions.  I set out what changes I think
 would make the current implementation OK.  Can we please, please have
 a discussion about those points instead of trying to argue about who
 has given more ground.

 That commitment would of course be good. However, even if that were
 possible
 before writing code and everyone agreed that the ideas of you and
 Nathaniel
 should be implemented in full, it's still not clear that either of you
 would
 be willing to write any code. Agreement without code still doesn't help
 us
 very much.

 I'm going to return to Nathaniel's point - it is a highly valuable
 thing to set ourselves the target of resolving substantial discussions
 by consensus.   The route you are endorsing here is 'implementor
 wins'.   We don't need to do it that way.  We're a mature sensible
 bunch of adults who can talk out the issues until we agree they are
 ready for implementation, and then implement.  That's all Nathaniel is
 saying.  I think he's obviously right, and I'm sad that it isn't as
 clear to y'all as it is to me.

 Best,

 Matthew


 Everyone, can we please not do this?! I had enough of adults doing finger
 pointing back over the summer during the whole debt ceiling debate.  I think
 we can all agree that we are better than the US congress?

 Forget about rudeness or decision processes.

 I will start by saying that I am willing to separate ignore and absent, but
 only on the write side of things.  On read, I want a single way to identify
 the missing values.  I also want only a single way to perform calculations
 (either skip or propagate).

 An indicator of success would be that people stop using NaNs and magic
 numbers (-, anyone?) and we could even deprecate nansum(), or at least
 strongly suggest in its docs to use NA.

Well, I haven't completely made up my mind yet, will have to do some
more 

Re: [Numpy-discussion] NA masks in the next numpy release?

2011-10-24 Thread Wes McKinney
On Mon, Oct 24, 2011 at 10:54 AM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Mon, Oct 24, 2011 at 8:40 AM, Charles R Harris
 charlesr.har...@gmail.com wrote:


 On Sun, Oct 23, 2011 at 11:23 PM, Wes McKinney wesmck...@gmail.com
 wrote:

 On Sun, Oct 23, 2011 at 8:07 PM, Eric Firing efir...@hawaii.edu wrote:
  On 10/23/2011 12:34 PM, Nathaniel Smith wrote:
 
  like. And in this case I do think we can come up with an API that will
  make everyone happy, but that Mark's current API probably can't be
  incrementally evolved to become that API.)
 
 
  No one could object to coming up with an API that makes everyone happy,
  provided that it actually gets coded up, tested, and is found to be
  fast
  and maintainable.  When you say the API probably can't be evolved, do
  you mean that the underlying implementation also has to be redone?  And
  if so, who will do it, and when?
 
  Eric
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 I personally am a bit apprehensive as I am worried about the masked
 array abstraction leaking through to users of pandas, something
 which I simply will not accept (why I decided against using numpy.ma
 early on, that + performance problems). Basically if having an
 understanding of masked arrays is a prerequisite for using pandas, the
 whole thing is DOA to me as it undermines the usability arguments I've
 been making about switching to Python (from R) for data analysis and
 statistical computing.

 The missing data functionality looks far more like R than numpy.ma.


 For instance

 In [8]: a = arange(5, maskna=1)

 In [9]: a[2] = np.NA

 In [10]: a.mean()
 Out[10]: NA(dtype='float64')

 In [11]: a.mean(skipna=1)
 Out[11]: 2.0

 In [12]: a = arange(5)

 In [13]: b = a.view(maskna=1)

 In [14]: a.mean()
 Out[14]: 2.0

 In [15]: b[2] = np.NA

 In [16]: b.mean()
 Out[16]: NA(dtype='float64')

 In [17]: b.mean(skipna=1)
 Out[17]: 2.0

 Chuck

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



I don't really agree with you.

some sample R code

 arr - rnorm(10)
 arr[5:8] - NA
 arr
 [1]  0.6451460 -1.1285552  0.6869828  0.4018868 NA NA
 [7] NA NA  0.3322803 -1.9201257

In your examples you had to pass maskna=True-- I suppose that my only
recourse would be to make sure that every array inside a DataFrame,
for example, has maskna=True set. I'll have to look in more detail and
see if it's feasible/desirable. There's a memory cost to pay, but you
can't get the functionality for free. I may just end up sticking with
NaN as it's worked pretty well so far the last few years-- it's an
impure solution but one with reasonably good performance
characteristics in the places that matter.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA masks in the next numpy release?

2011-10-23 Thread Wes McKinney
On Sun, Oct 23, 2011 at 8:07 PM, Eric Firing efir...@hawaii.edu wrote:
 On 10/23/2011 12:34 PM, Nathaniel Smith wrote:

 like. And in this case I do think we can come up with an API that will
 make everyone happy, but that Mark's current API probably can't be
 incrementally evolved to become that API.)


 No one could object to coming up with an API that makes everyone happy,
 provided that it actually gets coded up, tested, and is found to be fast
 and maintainable.  When you say the API probably can't be evolved, do
 you mean that the underlying implementation also has to be redone?  And
 if so, who will do it, and when?

 Eric
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I personally am a bit apprehensive as I am worried about the masked
array abstraction leaking through to users of pandas, something
which I simply will not accept (why I decided against using numpy.ma
early on, that + performance problems). Basically if having an
understanding of masked arrays is a prerequisite for using pandas, the
whole thing is DOA to me as it undermines the usability arguments I've
been making about switching to Python (from R) for data analysis and
statistical computing.

Performance is also a concern, but based on prior discussions it seems
a great deal can be done there.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Preventing an ndarray subclass from returning new subclass instances for std(), etc

2011-09-18 Thread Wes McKinney
On Sun, Sep 18, 2011 at 4:25 PM, Pierre GM pgmdevl...@gmail.com wrote:

 On Sep 18, 2011, at 21:25 , Stéfan van der Walt wrote:

 On Sun, Sep 18, 2011 at 9:48 AM, Keith Hughitt keith.hugh...@gmail.com 
 wrote:
 Interesting. It works as expected when called as a method:

 In [10]: x = np.ma.array([[1,2,3]])
 In [11]: x.std()
 Out[11]: 0.81649658092772603

 I'm guessing that np.ma.array implements its own std function, but I
 haven't checked.

 It does. np.std tries first to call the .std method of the input argument.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I thought maybe you can intercept 0-dim objects and return self.item()
in array_finalize, but not dice. This is really weird:

import numpy as np

class example(np.ndarray):

def __new__(cls, arr):
return np.array(arr).view(cls)

def __array_finalize__(self, obj):
print 'foo'
if self.ndim == 0:
return self.item()


In [6]: foo = example(np.arange(10))
foo

In [7]: foo.std()
foo
foo
foo
foo
foo
foo
foo
Out[7]: example(2.8722813232690143)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bug in numpy std, etc. with other data structures?

2011-09-17 Thread Wes McKinney
On Sat, Sep 17, 2011 at 4:48 PM, Skipper Seabold jsseab...@gmail.com wrote:
 Just ran into this. Any objections for having numpy.std and other
 functions in core/fromnumeric.py call asanyarray before trying to use
 the array's method? Other data structures like pandas and larry define
 their own std method, for instance, and this doesn't allow them to
 pass through. I'm inclined to say that the issue is with numpy, though
 maybe the data structures shouldn't shadow numpy array methods while
 altering the signature. I dunno.

 df = pandas.DataFrame(np.random.random((10,5)))

 np.std(df,axis=0)
 snip
 TypeError: std() got an unexpected keyword argument 'dtype'

 np.std(np.asanyarray(df),axis=0)
 array([ 0.30883352,  0.3133324 ,  0.26517361,  0.26389029,  0.20022444])

 Though I don't think this would work with larry yet.

 Pull request: https://github.com/numpy/numpy/pull/160

 Skipper
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Note I've no real intention of making DataFrame fully ndarray-like--
but it's nice to be able to type:

df.std(axis=0)
df.std(axis=1)
np.sqrt(df)

etc. which works the same as ndarray. I suppose the
__array__/__array_wrap__ interface is there largely as a convenience.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bug in numpy std, etc. with other data structures?

2011-09-17 Thread Wes McKinney
On Sat, Sep 17, 2011 at 8:36 PM,  josef.p...@gmail.com wrote:
 On Sat, Sep 17, 2011 at 5:12 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Sep 17, 2011 at 4:48 PM, Skipper Seabold jsseab...@gmail.com wrote:
 Just ran into this. Any objections for having numpy.std and other
 functions in core/fromnumeric.py call asanyarray before trying to use
 the array's method? Other data structures like pandas and larry define
 their own std method, for instance, and this doesn't allow them to
 pass through. I'm inclined to say that the issue is with numpy, though
 maybe the data structures shouldn't shadow numpy array methods while
 altering the signature. I dunno.

 df = pandas.DataFrame(np.random.random((10,5)))

 np.std(df,axis=0)
 snip
 TypeError: std() got an unexpected keyword argument 'dtype'

 np.std(np.asanyarray(df),axis=0)
 array([ 0.30883352,  0.3133324 ,  0.26517361,  0.26389029,  0.20022444])

 Though I don't think this would work with larry yet.

 Pull request: https://github.com/numpy/numpy/pull/160

 Skipper
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Note I've no real intention of making DataFrame fully ndarray-like--
 but it's nice to be able to type:

 df.std(axis=0)
 df.std(axis=1)
 np.sqrt(df)

 etc. which works the same as ndarray. I suppose the
 __array__/__array_wrap__ interface is there largely as a convenience.

 I'm a bit worried about the different ddof defaults in cases like
 this. Essentially we will not be able to rely on ddof=0 anymore.
 Different defaults on axis are easy to catch, but having the same
 function call return sometimes ddof=0 and sometimes ddof=1 might make
 for some fun debugging.

 Josef


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Can we lobby for default ddof=1 in NumPy 2.0? Breaking with a
convention like this doesn't make much sense to me.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bug in numpy std, etc. with other data structures?

2011-09-17 Thread Wes McKinney
On Sat, Sep 17, 2011 at 10:50 PM, Bruce Southey bsout...@gmail.com wrote:
 On Sat, Sep 17, 2011 at 4:12 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Sep 17, 2011 at 4:48 PM, Skipper Seabold jsseab...@gmail.com wrote:
 Just ran into this. Any objections for having numpy.std and other
 functions in core/fromnumeric.py call asanyarray before trying to use
 the array's method? Other data structures like pandas and larry define
 their own std method, for instance, and this doesn't allow them to
 pass through. I'm inclined to say that the issue is with numpy, though
 maybe the data structures shouldn't shadow numpy array methods while
 altering the signature. I dunno.

 df = pandas.DataFrame(np.random.random((10,5)))

 np.std(df,axis=0)
 snip
 TypeError: std() got an unexpected keyword argument 'dtype'

 np.std(np.asanyarray(df),axis=0)
 array([ 0.30883352,  0.3133324 ,  0.26517361,  0.26389029,  0.20022444])

 Though I don't think this would work with larry yet.

 Pull request: https://github.com/numpy/numpy/pull/160

 Skipper
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 numpy.std()  does accepts array-like which obvious means that
 np.std([1,2,3,5]) works making asanyarray call a total waste of cpu
 time. Clearly pandas is not array-like input (as Wes points out below)
 so an error is correct. Doing this type of 'fix' will have unintended
 consequences when other non-numpy objects are incorrectly passed to
 numpy functions. Rather you should determine why 'array-like' failed
 here IF you think a pandas object is either array-like or a numpy
 object.

No, the reason it is failing is because np.std takes the
EAFP/duck-typing approach:

try:
std = a.std
except AttributeError:
return _wrapit(a, 'std', axis, dtype, out, ddof)
return std(axis, dtype, out, ddof)

Indeed DataFrame has an std method but it doesn't have the same
function signature as ndarray.std.



 Note I've no real intention of making DataFrame fully ndarray-like--
 but it's nice to be able to type:

 df.std(axis=0)
 df.std(axis=1)
 np.sqrt(df)

 etc. which works the same as ndarray. I suppose the
 __array__/__array_wrap__ interface is there largely as a convenience.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 I consider that the only way pandas or any other numpy-derivative to
 overcome this is get into numpy/scipy. After all Travis opened the
 discussion for Numpy 3 which you could still address.

 Bruce
 PS Good luck on the ddof thing given the past discussions on it!
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA masks for NumPy are ready to test

2011-08-24 Thread Wes McKinney
On Wed, Aug 24, 2011 at 8:19 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Fri, Aug 19, 2011 at 11:37 AM, Bruce Southey bsout...@gmail.com wrote:

 Hi,
 snip

 2) Can the 'skipna' flag be added to the methods?
  a.sum(skipna=True)
 Traceback (most recent call last):
  File stdin, line 1, in module
 TypeError: 'skipna' is an invalid keyword argument for this function
  np.sum(a,skipna=True)
 nan

 I've added this now, as well. I think that finishes up the changes you
 suggested in this email which felt right to me.
 Cheers,
 Mark


 snip

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



Sorry I haven't had a chance to have a tinker yet. My initial observations:

- I haven't decided whether this is a problem:

In [50]: arr = np.arange(100)

In [51]: arr[5:10] = np.NA
---
ValueErrorTraceback (most recent call last)
/home/wesm/ipython-input-51-7e07a94409e9 in module()
 1 arr[5:10] = np.NA

ValueError: Cannot set NumPy array values to NA values without first
enabling NA support in the array

I assume when you flip the maskna switch that a mask is created?

- Performance with skipna is a bit disappointing:

In [52]: arr = np.random.randn(1e6)
In [54]: arr.flags.maskna = True
In [56]: arr[::2] = np.NA
In [58]: timeit arr.sum(skipna=True)
100 loops, best of 3: 7.31 ms per loop

this goes down to 2.12 ms if there are no NAs present.

but:

In [59]: import bottleneck as bn
In [60]: arr = np.random.randn(1e6)
In [61]: arr[::2] = np.nan
In [62]: timeit bn.nansum(arr)
1000 loops, best of 3: 1.17 ms per loop

do you have a sense if this gap can be closed? I assume you've been,
as you should, focused on a correct implementation as opposed with
squeezing out performance.

best,
Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Questionable reduceat behavior

2011-08-16 Thread Wes McKinney
On Sun, Aug 14, 2011 at 11:58 AM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Aug 13, 2011 at 8:06 PM, Mark Wiebe mwwi...@gmail.com wrote:
 Looks like this is the second-oldest open bug in the bug tracker.
 http://projects.scipy.org/numpy/ticket/236
 For what it's worth, I'm in favour of changing this behavior to be more
 consistent as proposed in that ticket.
 -Mark

 On Thu, Aug 11, 2011 at 11:25 AM, Wes McKinney wesmck...@gmail.com wrote:

 I'm a little perplexed why reduceat was made to behave like this:

 In [26]: arr = np.ones((10, 4), dtype=bool)

 In [27]: arr
 Out[27]:
 array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)


 In [30]: np.add.reduceat(arr, [0, 3, 3, 7, 9], axis=0)
 Out[30]:
 array([[3, 3, 3, 3],
       [1, 1, 1, 1],
       [4, 4, 4, 4],
       [2, 2, 2, 2],
       [1, 1, 1, 1]])

 this does not seem intuitively correct. Since we have:

 In [33]: arr[3:3].sum(0)
 Out[33]: array([0, 0, 0, 0])

 I would expect

 array([[3, 3, 3, 3],
       [0, 0, 0, 0],
       [4, 4, 4, 4],
       [2, 2, 2, 2],
       [1, 1, 1, 1]])

 Obviously I can RTFM and see why it does this (if ``indices[i] =
 indices[i + 1]``, the i-th generalized row is simply
 ``a[indices[i]]``), but it doesn't make much sense to me, and I need
 work around it. Suggestions?
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 Well, I certainly hope it doesn't get forgotten about for another 5
 years. I think having more consistent behavior would be better rather
 than conforming to a seemingly arbitrary decision made ages ago in
 Numeric.

 - Wes


just a manual hack for now where I needed it...

https://github.com/wesm/pandas/blob/master/pandas/core/frame.py#L2155
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Questionable reduceat behavior

2011-08-14 Thread Wes McKinney
On Sat, Aug 13, 2011 at 8:06 PM, Mark Wiebe mwwi...@gmail.com wrote:
 Looks like this is the second-oldest open bug in the bug tracker.
 http://projects.scipy.org/numpy/ticket/236
 For what it's worth, I'm in favour of changing this behavior to be more
 consistent as proposed in that ticket.
 -Mark

 On Thu, Aug 11, 2011 at 11:25 AM, Wes McKinney wesmck...@gmail.com wrote:

 I'm a little perplexed why reduceat was made to behave like this:

 In [26]: arr = np.ones((10, 4), dtype=bool)

 In [27]: arr
 Out[27]:
 array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)


 In [30]: np.add.reduceat(arr, [0, 3, 3, 7, 9], axis=0)
 Out[30]:
 array([[3, 3, 3, 3],
       [1, 1, 1, 1],
       [4, 4, 4, 4],
       [2, 2, 2, 2],
       [1, 1, 1, 1]])

 this does not seem intuitively correct. Since we have:

 In [33]: arr[3:3].sum(0)
 Out[33]: array([0, 0, 0, 0])

 I would expect

 array([[3, 3, 3, 3],
       [0, 0, 0, 0],
       [4, 4, 4, 4],
       [2, 2, 2, 2],
       [1, 1, 1, 1]])

 Obviously I can RTFM and see why it does this (if ``indices[i] =
 indices[i + 1]``, the i-th generalized row is simply
 ``a[indices[i]]``), but it doesn't make much sense to me, and I need
 work around it. Suggestions?
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



Well, I certainly hope it doesn't get forgotten about for another 5
years. I think having more consistent behavior would be better rather
than conforming to a seemingly arbitrary decision made ages ago in
Numeric.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Questionable reduceat behavior

2011-08-11 Thread Wes McKinney
I'm a little perplexed why reduceat was made to behave like this:

In [26]: arr = np.ones((10, 4), dtype=bool)

In [27]: arr
Out[27]:
array([[ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True]], dtype=bool)


In [30]: np.add.reduceat(arr, [0, 3, 3, 7, 9], axis=0)
Out[30]:
array([[3, 3, 3, 3],
   [1, 1, 1, 1],
   [4, 4, 4, 4],
   [2, 2, 2, 2],
   [1, 1, 1, 1]])

this does not seem intuitively correct. Since we have:

In [33]: arr[3:3].sum(0)
Out[33]: array([0, 0, 0, 0])

I would expect

array([[3, 3, 3, 3],
   [0, 0, 0, 0],
   [4, 4, 4, 4],
   [2, 2, 2, 2],
   [1, 1, 1, 1]])

Obviously I can RTFM and see why it does this (if ``indices[i] =
indices[i + 1]``, the i-th generalized row is simply
``a[indices[i]]``), but it doesn't make much sense to me, and I need
work around it. Suggestions?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Warning: invalid value encountered in divide

2011-08-08 Thread Wes McKinney
On Mon, Aug 8, 2011 at 7:01 PM, Neal Becker ndbeck...@gmail.com wrote:
 Warning: invalid value encountered in divide

 No traceback.  How can I get more info on this?  Can this warning be converted
 to an exception so I can get a trace?

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Try calling np.seterr(divide='raise') or np.seterr(all='raise')
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] inconsistent semantics for double-slicing

2011-07-27 Thread Wes McKinney
On Wed, Jul 27, 2011 at 5:36 PM, Alex Flint alex.fl...@gmail.com wrote:
 When applying two different slicing operations in succession (e.g. select a
 sub-range, then select using a binary mask) it seems that numpy arrays can
 be inconsistent with respect to assignment:
 For example, in this case an array is modified:
 In [6]: A = np.arange(5)
 In [8]: A[:][A2] = 0
 In [10]: A
 Out[10]: array([0, 1, 2, 0, 0])
 Whereas here the original array remains unchanged
 In [11]: A = np.arange(5)
 In [12]: A[[0,1,2,3,4]][A2] = 0
 In [13]: A
 Out[13]: array([0, 1, 2, 3, 4])
 This arose in a less contrived situation in which I was trying to copy a
 small image into a large image, modulo a mask on the small image.
 Is this meant to be like this?
 Alex

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



When you do this:

A[[0,1,2,3,4]][A2] = 0

what is happening is:

A.__getitem__([0,1,2,3,4]).__setitem__(A  2, 0)

Whenever you do getitem with fancy indexing (i.e. A[[0,1,2,3,4]]),
it produces a new object. In the first case, slicing A[:] produces a
view on the same data.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] named ndarray axes

2011-07-13 Thread Wes McKinney
On Wed, Jul 13, 2011 at 7:15 PM, Craig Yoshioka crai...@me.com wrote:
 Yup exactly.  To enable this sort of tracking I needed to explicitly 
 reverse-engineer the effects of indexing on axes.  I figure overriding 
 indexing catches most cases that modify axes, but other holes need to be 
 plugged as well... ie: tranpose, swapaxes.  This probably means most C 
 functions that change array axes (np.mean(axis=), etc.) need to be covered as 
 well that sucks.

 BTW, it sounds like you're trying to track very similar data.  I am trying to 
 load structural biology data formats, and I try to preserve as much of the 
 metadata as possible, ie: I am encoding unit cell length/angle information as 
 vectors, etc.

 Here is my implementation:

    def __getitem__(self,idxs):
        idxs,trans,keep = idx_axes(self,idxs)
        array = self.view(np.ndarray).transpose(trans)
        array = array[idxs]
        if isinstance(array,ndarray):
            array = array.view(self.__class__)
            array.axes = self.axes.transpose(keep)
        return array


 def idx_axes(array,idxs):

    # explicitly expand ellipsis
    expanded_idxs = idx_expanded(array.ndim,idxs)

    # determine how the axes will be rearranged as a result of axes-based 
 indexing
    # and the creation of newaxes
    remapped_axes = idx_axes_remapped(array.ndim,array.axes,expanded_idxs)

    # determine numpy compatible transpose, before newaxes are created
    transposed_axes = idx_axes_transposed(remapped_axes)

    # determine numpy compatible indexes with axes-based indexing removed
    filtered_idxs = idx_filtered(expanded_idxs)

    # determine which axes will be kept after numpy indexing
    kept_axes = idx_axes_kept(remapped_axes,filtered_idxs)

    return filtered_idxs,transposed_axes,kept_axes


 def idx_expanded(ndim,idxs):
    '''
    explicitly expands ellipsis taking into account newaxes
    '''

    if not isinstance(idxs,tuple):
        return idx_expanded(ndim,(idxs,))

    # how many dimensions we will end up having
    ndim = ndim + idxs.count(newaxis)

    filler = slice(None)

    def skip_ellipsis(idxs):
        return tuple([filler if isinstance(x,type(Ellipsis)) else x for x in 
 idxs])

    def fill_ellipsis(ndim,l,r):
        return (filler,)*(ndim-len(l)-len(r))

    # expand first ellipsis, treat all other ellipsis as slices
    if Ellipsis in idxs:
        idx = idxs.index(Ellipsis)
        llist = idxs[:idx]
        rlist = skip_ellipsis(idxs[idx+1:])
        cfill = fill_ellipsis(ndim,llist,rlist)
        idxs = llist + cfill + rlist

    return idxs


 def idx_filtered(idxs):
    '''
    replace indexes on axes with normal numpy indexes
    '''
    def axisAsIdx(idx):
        if isinstance(idx,str):
            return slice(None)
        elif isinstance(idx,slice):
            if isinstance(idx.start,str):
                return idx.stop
        return idx

    return tuple([axisAsIdx(x) for x in idxs])


 def idx_axes_remapped(ndim,axes,idxs):
    '''
    if given a set of array indexes that contain labeled axes,
    return a tuple that maps axes in the source array to the axes
    as they will end up in the destination array.  Must take into
    account the spaces created by newaxis indexes.
    '''

    # how many dims are we expecting?
    ndim = ndim + idxs.count(newaxis)

    # new unique object for marking unassigned locations in mapping
    unassigned = object()

    # by default all locations are unsassigned
    mapping = [unassigned] * ndim

    # find labels in indexes and set the dims for those locations
    for dim,label in enumerate(idxs):
        if label == newaxis:
            mapping[dim] = label
        elif isinstance(label,str):
            mapping[dim] = axes.dimForLabel(label)
        elif isinstance(label,slice):
            if isinstance(label.start,str):
                mapping[dim] = axes.dimForLabel(label.start)

    # find unassigned dims, in order
    unmapped = [d for d in range(ndim) if d not in set(mapping)]

    # fill in remaining unassigned locations with dims
    for dst,src in enumerate(mapping):
        if src == unassigned:
            mapping[dst] = unmapped.pop(0)

    return tuple(mapping)


 def idx_axes_transposed(mapping):
    '''
    stripping out newaxes in mapping yields a tuple compatible with transpose
    '''
    return tuple([x for x in mapping if x != newaxis])


 def idx_axes_kept(mapping,idxs):
    '''
    remove axes from mapping that will not survive the indexing (ie: ints)
    '''
    kept = []
    first_list = True
    for dst,src in enumerate(mapping):
        if dst  len(idxs):
            idx = idxs[dst]
            if isinstance(idx,int):
                continue
            elif isinstance(idx,list):
                if not first_list:
                    continue
                first_list = False
        kept += [src]
    return tuple(kept)




 On Jul 13, 2011, at 3:13 PM, Sam Quinan wrote:

 I'm currently working on interfacing ndarrays with a 

Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

2011-06-25 Thread Wes McKinney
On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney wesmck...@gmail.com wrote:

 On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith n...@pobox.com wrote:
  On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root ben.r...@ou.edu wrote:
  On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith n...@pobox.com wrote:
  This is a situation where I would just... use an array and a mask,
  rather than a masked array. Then lots of things -- changing fill
  values, temporarily masking/unmasking things, etc. -- come from free,
  just from knowing how arrays and boolean indexing work?
 
  With a masked array, it is for free.  Why re-invent the wheel?  It
  has
  already been done for me.
 
  But it's not for free at all. It's an additional concept that has to
  be maintained, documented, and learned (with the last cost, which is
  multiplied by the number of users, being by far the greatest). It's
  not reinventing the wheel, it's saying hey, I have wheels and axles,
  but what I really need the library to provide is a wheel+axle
  assembly!

 You're communicating my argument better than I am.

  Do we really get much advantage by building all these complex
  operations in? I worry that we're trying to anticipate and write code
  for every situation that users find themselves in, instead of just
  giving them some simple, orthogonal tools.
 
 
  This is the danger, and which is why I advocate retaining the
  MaskedArray
  type that would provide the high-level intelligent operations,
  meanwhile
  having in the core the basic data structures for  pairing a mask with
  an
  array, and to recognize a special np.NA value that would act upon the
  mask
  rather than the underlying data.  Users would get very basic
  functionality,
  while the MaskedArray would continue to provide the interface that we
  are
  used to.
 
  The interface as described is quite different... in particular, all
  aggregate operations would change their behavior.
 
  As a corollary, I worry that learning and keeping track of how masked
  arrays work is more hassle than just ignoring them and writing the
  necessary code by hand as needed. Certainly I can imagine that *if the
  mask is a property of the data* then it's useful to have tools to keep
  it aligned with the data through indexing and such. But some of these
  other things are quicker to reimplement than to look up the docs for,
  and the reimplementation is easier to read, at least for me...
 
  What you are advocating is similar to the tried-n-true coding
  practice of
  Matlab users of using NaNs.  You will hear from Matlab programmers
  about how
  it is the greatest idea since sliced bread (and I was one of them).
  Then I
  was introduced to Numpy, and I while I do sometimes still do the NaN
  approach, I realized that the masked array is a better way.
 
  Hey, no need to go around calling people Matlab programmers, you might
  hurt someone's feelings.
 
  But seriously, my argument is that every abstraction and new concept
  has a cost, and I'm dubious that the full masked array abstraction
  carries its weight and justifies this cost, because it's highly
  redundant with existing abstractions. That has nothing to do with how
  tried-and-true anything is.

 +1. I think I will personally only be happy if masked array can be
 implemented while incurring near-zero cost from the end user
 perspective. If what we end up with is a faster implementation of
 numpy.ma in C I'm probably going to keep on using NaN... That's why
 I'm entirely insistent that whatever design be dogfooded on non-expert
 users. If it's very much harder / trickier / nuanced than R, you will
 have failed.


 This sounds unduly pessimistic to me. It's one thing to suggest different
 approaches, another to cry doom and threaten to go eat worms. And all before
 the code is written, benchmarks run, or trial made of the usefulness of the
 approach. Let us see how things look as they get worked out. Mark has a good
 track record for innovative tools and I'm rather curious myself to see what
 the result is.

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



I hope you're right. So far it seems that anyone who has spent real
time with R (e.g. myself, Nathaniel) has expressed serious concerns
about the masked approach. And we got into this discussion at the Data
Array summit in Austin last month because we're trying to make Python
more competitive with R viz statistical and financial applications.
I'm just trying to be (R)ealistic =P Remember that I very earnestly am
doing everything I can these days to make scientific Python more
successful in finance and statistics. One big difference with R's
approach is that we care more about performance the the R community
does. So maybe having special NA values will be prohibitive

Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Wes McKinney
On Sat, Jun 25, 2011 at 1:05 PM, Nathaniel Smith n...@pobox.com wrote:
 On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com 
 wrote:
 So far I see the difference between 1) and 2) being that you cannot
 unmask.  So, if you didn't even know you could unmask data, then it
 would not matter that 1) was being implemented by masks?

 I guess that is a difference, but I'm trying to get at something more
 fundamental -- not just what operations are allowed, but what
 operations people *expect* to be allowed. It seems like some of us
 have been talking past each other a lot, where someone says but
 changing masks is the single most important feature! and then someone
 else says what are you talking about that doesn't even make sense.

 To clarify, you're proposing for:

 a = np.sum(np.array([np.NA, np.NA])

 1) - np.NA
 2) - 0.0

 Yes -- and in R you get actually do get NA, while in numpy.ma you
 actually do get 0. I don't think this is a coincidence; I think it's
 because they're designed as coherent systems that are trying to solve
 different problems. (Well, numpy.ma's hardmask idea seems inspired
 by the missing-data concept rather than the temporary-mask concept,
 but aside from that it seems pretty consistent in implementing option
 2.)

Agree. My basic observation about numpy.ma is that it's a finely
crafted solution for a different set of problems than the ones I have.
I just don't want the same thing to happen here so I'm stuck writing
code (like I am now) that looks like

mask = y.mask
the_sum = y.sum(axis)
the_count = mask.sum(axis)
the_sum[the_count == 0] = nan

 Here's another possible difference -- in (1), intuitively, missingness
 is a property of the data, so the logical place to put information
 about whether you can expect missing values is in the dtype, and to
 enable missing values you need to make a new array with a new dtype.
 (If we use a mask-based implementation, then
 np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
 to skip making a copy of the data -- I'm talking ONLY about the
 interface here, not whether missing data has a different storage
 format from non-missing data.)

 In (2), the whole point is to use different masks with the same data,
 so I'd argue masking should be a property of the array object rather
 than the dtype, and the interface should logically allow masks to be
 created, modified, and destroyed in place.

 They're both internally consistent, but I think we might have to make
 a decision and stick to it.

 I agree it's good to separate the API from the implementation.   I
 think the implementation is also important because I care about memory
 and possibly speed.  But, that is a separate problem from the API...

 Yes, absolutely memory and speed are important. But a really fast
 solution to the wrong problem isn't so useful either :-).

 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Concepts for masked/missing data

2011-06-25 Thread Wes McKinney
On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith n...@pobox.com wrote:
 On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root ben.r...@ou.edu wrote:
 On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote:
 I guess that is a difference, but I'm trying to get at something more
 fundamental -- not just what operations are allowed, but what
 operations people *expect* to be allowed.

 That is quite a trickier problem.

 It can be. I think of it as the difference between design and coding.
 They overlap less than one might expect...

 Here's another possible difference -- in (1), intuitively, missingness
 is a property of the data, so the logical place to put information
 about whether you can expect missing values is in the dtype, and to
 enable missing values you need to make a new array with a new dtype.
 (If we use a mask-based implementation, then
 np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
 to skip making a copy of the data -- I'm talking ONLY about the
 interface here, not whether missing data has a different storage
 format from non-missing data.)

 In (2), the whole point is to use different masks with the same data,
 so I'd argue masking should be a property of the array object rather
 than the dtype, and the interface should logically allow masks to be
 created, modified, and destroyed in place.


 I can agree with this distinction.  However, if missingness is an
 intrinsic property of the data, then shouldn't users be implementing their
 own dtype tailored to the data they are using?  In other words, how far does
 the core of NumPy need to go to address this issue?  And how far would be
 too much?

 Yes, that's exactly my question: whether our goal is to implement
 missingness in numpy or not!


 They're both internally consistent, but I think we might have to make
 a decision and stick to it.


 Of course.  I think that Mark is having a very inspired idea of giving the R
 audience what they want (np.NA), while simultaneously making the use of
 masked arrays even easier (which I can certainly appreciate).

 I don't know. I think we could build a really top-notch implementation
 of missingness. I also think we could build a really top-notch
 implementation of masking. But my suggestions for how to improve the
 current design are totally different depending on which of those is
 the goal, and neither the R audience (like me) nor the masked array
 audience (like you) seems really happy with the current design. And I
 don't know what the goal is -- maybe it's something else and the
 current design hits it perfectly? Maybe we want a top-notch
 implementation of *both* missingness and masking, and those should be
 two different things that can be combined, so that some of the
 unmasked values inside a masked array can be NA? I don't know.

 I will put out a little disclaimer.  I once had to use S+ for a class.  To
 be honest, it was the worst programming experience in my life.  This
 experience may be coloring my perception of R's approach to handling missing
 data.

 There's a lot of things that R does wrong (not their fault; language
 design is an extremely difficult and specialized skill, that
 statisticians are not exactly trained in), but it did make a few
 excellent choices at the beginning. One was to steal the execution
 model from Scheme, which, uh, isn't really relevant here. The other
 was to steal the basic data types and standard library that the Bell
 Labs statisticians had pounded into shape over many years. I use
 Python now because using R for everything would drive me crazy, but
 despite its many flaws, it still does some things so well that it's
 become *the* language used for basically all statistical research. I'm
 only talking about stealing those things :-).

 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


+1. Everyone knows R ain't perfect. I think it's an atrociously bad
programming language but it can be unbelievably good at statistics, as
evidenced by its success. Brings to mind Andy Gelman's blog last fall:

http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html

As someone in a statistics department I've frequently been
disheartened when I see how easy many statistical things are in R and
how much more difficult they are in Python. This is partially the
result of poor interfaces for statistical modeling, partially due to
data structures (e.g. the integrated-ness of data.frame throughout R)
and things like handling of missing data of which there's currently no
equivalent.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

2011-06-24 Thread Wes McKinney
On Fri, Jun 24, 2011 at 12:33 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith n...@pobox.com wrote:

 On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe mwwi...@gmail.com wrote:
  On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith n...@pobox.com wrote:
  It's should also be possible to accomplish a general solution at the
  dtype level. We could have a 'dtype factory' used like:
   np.zeros(10, dtype=np.maybe(float))
  where np.maybe(x) returns a new dtype whose storage size is x.itemsize
  + 1, where the extra byte is used to store missingness information.
  (There might be some annoying alignment issues to deal with.) Then for
  each ufunc we define a handler for the maybe dtype (or add a
  special-case to the ufunc dispatch machinery) that checks the
  missingness value and then dispatches to the ordinary ufunc handler
  for the wrapped dtype.
 
  The 'dtype factory' idea builds on the way I've structured datetime as a
  parameterized type, but the thing that kills it for me is the alignment
  problems of 'x.itemsize + 1'. Having the mask in a separate memory block
  is
  a lot better than having to store 16 bytes for an 8-byte int to preserve
  the
  alignment.

 Hmm. I'm not convinced that this is the best approach either, but let
 me play devil's advocate.

 The disadvantage of this approach is that masked arrays would
 effectively have a 100% memory overhead over regular arrays, as
 opposed to the shadow mask approach where the memory overhead is
 12.5%--100% depending on the size of objects being stored. Probably
 the most common case is arrays of doubles, in which case it's 100%
 versus 12.5%. So that sucks.

 But on the other hand, we gain:
  -- simpler implementation: no need to be checking and tracking the
 mask buffer everywhere. The needed infrastructure is already built in.

 I don't believe this is true. The dtype mechanism would need a lot of work
 to build that needed infrastructure first. The analysis I've done so far
 indicates the masked approach will give a simpler/cleaner implementation.


  -- simpler conceptually: we already have the dtype concept, it's a
 very powerful and we use it for all sorts of things; using it here too
 plays to our strengths. We already know what a numpy scalar is and how
 it works. Everyone already understands how assigning a value to an
 element of an array works, how it interacts with broadcasting, etc.,
 etc., and in this model, that's all a missing value is -- just another
 value.

 From Python, this aspect of things would be virtually identical between the
 two mechanisms. The dtype approach would require more coding and overhead
 where you have to create copies of your data to convert it into the
 parameterized NA[int32] dtype, versus with the masked approach where you
 say x.flags.hasmask = True or something like that without copying the data.


  -- it composes better with existing functionality: for example,
 someone mentioned the distinction between a missing field inside a
 record versus a missing record. In this model, that would just be the
 difference between dtype([(x, maybe(float))]) and maybe(dtype([(x,
 float)])).

 Indeed, the difference between an NA[:x:f4, :y:f4] versus :x:NA[f4],
 :y:NA[f4] can't be expressed the way I've designed the mask functionality.
 (Note, this struct dtype string representation isn't actually supported in
 NumPy.)

 Optimization is important and all, but so is simplicity and
 robustness. That's why we're using Python in the first place :-).

 If we think that the memory overhead for floating point types is too
 high, it would be easy to add a special case where maybe(float) used a
 distinguished NaN instead of a separate boolean. The extra complexity
 would be isolated to the 'maybe' dtype's inner loop functions, and
 transparent to the Python level. (Implementing a similar optimization
 for the masking approach would be really nasty.) This would change the
 overhead comparison to 0% versus 12.5% in favor of the dtype approach.

 Yeah, there would be no such optimization for the masked approach. If
 someone really wants this, they are not precluded from also implementing
 their own nafloat dtype which operates independently of the masking
 mechanism.
 -Mark


 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



I don't have enough time to engage in this discussion as I'd like but
I'll give my input.

I've spent a very  amount of time in pandas trying to craft a sensible
and performant missing-data-handling solution giving existing tools
which does not get in the user's way and which also works for
non-floating point data. The result works but isn't completely 100%
satisfactory, and I 

Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

2011-06-24 Thread Wes McKinney
On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett matthew.br...@gmail.com
 wrote:

 Hi,

 On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root ben.r...@ou.edu wrote:
 ...
  Again, there are pros and cons either way and I see them very orthogonal
  and
  complementary.

 That may be true, but I imagine only one of them will be implemented.

 @Mark - I don't have a clear idea whether you consider the nafloat64
 option to be still in play as the first thing to be implemented
 (before array.mask).   If it is, what kind of thing would persuade you
 either way?


 Mark can speak for himself,  but I think things are tending towards masks.
 They have the advantage of one implementation for all data types, current
 and future, and they are more flexible since the masked data can be actual
 valid data that you just choose to ignore for experimental  reasons.

 What might be helpful is a routine to import/export R files, but that
 shouldn't be to difficult to implement.

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



Perhaps we should make a wiki page someplace summarizing pros and cons
of the various implementation approaches? I worry very seriously about
adding API functions relating to masks rather than having special NA
values which propagate in algorithms. The question is: will Joe Blow
Former R user have to understand what is the mask and how to work with
it? If the answer is yes we have a problem. If it can be completely
hidden as an implementation detail, that's great. In R NAs are just
sort of inherent-- they propagate you deal with them when you have to
via na.rm flag in functions or is.na.

The other problem I can think of with masks is the extra memory
footprint, though maybe this is no cause for concern.

-W
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

2011-06-24 Thread Wes McKinney
On Fri, Jun 24, 2011 at 8:02 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Fri, Jun 24, 2011 at 5:22 PM, Wes McKinney wesmck...@gmail.com wrote:

 On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
  On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett matthew.br...@gmail.com
  wrote:
 
  Hi,
 
  On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root ben.r...@ou.edu
  wrote:
  ...
   Again, there are pros and cons either way and I see them very
   orthogonal
   and
   complementary.
 
  That may be true, but I imagine only one of them will be implemented.
 
  @Mark - I don't have a clear idea whether you consider the nafloat64
  option to be still in play as the first thing to be implemented
  (before array.mask).   If it is, what kind of thing would persuade you
  either way?
 
 
  Mark can speak for himself,  but I think things are tending towards
  masks.
  They have the advantage of one implementation for all data types,
  current
  and future, and they are more flexible since the masked data can be
  actual
  valid data that you just choose to ignore for experimental  reasons.
 
  What might be helpful is a routine to import/export R files, but that
  shouldn't be to difficult to implement.
 
  Chuck
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 

 Perhaps we should make a wiki page someplace summarizing pros and cons
 of the various implementation approaches? I worry very seriously about
 adding API functions relating to masks rather than having special NA
 values which propagate in algorithms. The question is: will Joe Blow
 Former R user have to understand what is the mask and how to work with
 it? If the answer is yes we have a problem. If it can be completely
 hidden as an implementation detail, that's great. In R NAs are just
 sort of inherent-- they propagate you deal with them when you have to
 via na.rm flag in functions or is.na.


 Well, I think both of those can be pretty transparent. Could you illustrate
 some typical R usage, to wit.

 1) setting a value to na
 2) checking a value for na

 Other things are problematic, like checking for integer overflow. For safety
 that would be desireable, for speed not. I think that is a separate question
 however. In any case, if we do check such things we should be able to set
 the corresponding mask value in the loop, and I suppose that is the sort of
 thing you want.

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



I think anyone making decisions about this needs to have a pretty good
understanding of what R does. So here's some examples but you guys
really need to spend some time with R if you have not already

arr - rnorm(20)
arr
 [1]  1.341960278  0.757033314 -0.910468762 -0.475811935 -0.007973053
 [6]  1.618201117 -0.965747088  0.386811224  0.229158237  0.987050613
[11]  1.293453170 -2.432399045 -0.247593481 -0.639769586 -0.464996583
[16]  0.720181047  0.846607030  0.486173088 -0.911247626  0.370326788
arr[5:10] = NA
arr
 [1]  1.3419603  0.7570333 -0.9104688 -0.4758119 NA NA
 [7] NA NA NA NA  1.2934532 -2.4323990
[13] -0.2475935 -0.6397696 -0.4649966  0.7201810  0.8466070  0.4861731
[19] -0.9112476  0.3703268
is.na(arr)
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mean(arr)
[1] NA
mean(arr, na.rm=T)
[1] -0.01903945

arr + rnorm(20)
 [1]  2.081580297  0.505050028 -0.696287035 -1.280323279   NA
 [6]   NA   NA   NA   NA   NA
[11]  2.166078369 -1.445271291  0.764894624  0.795890929  0.549621207
[16]  0.005215596 -0.170001426  0.712335355 -0.919671745 -0.617099818

and obviously this is OK too:

arr - rep('wes', 10)
arr[5:7] - NA
is.na(arr)
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

note, NA gets excluded from categorical variables (factors):
as.factor(arr)
 [1] wes  wes  wes  wes  NA NA NA wes  wes  wes
Levels: wes

e.g. groupby with NA:

 tapply(rnorm(10), arr, mean)
   wes
-0.5271853
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

2011-06-24 Thread Wes McKinney
On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith n...@pobox.com wrote:
 On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root ben.r...@ou.edu wrote:
 On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith n...@pobox.com wrote:
 This is a situation where I would just... use an array and a mask,
 rather than a masked array. Then lots of things -- changing fill
 values, temporarily masking/unmasking things, etc. -- come from free,
 just from knowing how arrays and boolean indexing work?

 With a masked array, it is for free.  Why re-invent the wheel?  It has
 already been done for me.

 But it's not for free at all. It's an additional concept that has to
 be maintained, documented, and learned (with the last cost, which is
 multiplied by the number of users, being by far the greatest). It's
 not reinventing the wheel, it's saying hey, I have wheels and axles,
 but what I really need the library to provide is a wheel+axle
 assembly!

You're communicating my argument better than I am.

 Do we really get much advantage by building all these complex
 operations in? I worry that we're trying to anticipate and write code
 for every situation that users find themselves in, instead of just
 giving them some simple, orthogonal tools.


 This is the danger, and which is why I advocate retaining the MaskedArray
 type that would provide the high-level intelligent operations, meanwhile
 having in the core the basic data structures for  pairing a mask with an
 array, and to recognize a special np.NA value that would act upon the mask
 rather than the underlying data.  Users would get very basic functionality,
 while the MaskedArray would continue to provide the interface that we are
 used to.

 The interface as described is quite different... in particular, all
 aggregate operations would change their behavior.

 As a corollary, I worry that learning and keeping track of how masked
 arrays work is more hassle than just ignoring them and writing the
 necessary code by hand as needed. Certainly I can imagine that *if the
 mask is a property of the data* then it's useful to have tools to keep
 it aligned with the data through indexing and such. But some of these
 other things are quicker to reimplement than to look up the docs for,
 and the reimplementation is easier to read, at least for me...

 What you are advocating is similar to the tried-n-true coding practice of
 Matlab users of using NaNs.  You will hear from Matlab programmers about how
 it is the greatest idea since sliced bread (and I was one of them).  Then I
 was introduced to Numpy, and I while I do sometimes still do the NaN
 approach, I realized that the masked array is a better way.

 Hey, no need to go around calling people Matlab programmers, you might
 hurt someone's feelings.

 But seriously, my argument is that every abstraction and new concept
 has a cost, and I'm dubious that the full masked array abstraction
 carries its weight and justifies this cost, because it's highly
 redundant with existing abstractions. That has nothing to do with how
 tried-and-true anything is.

+1. I think I will personally only be happy if masked array can be
implemented while incurring near-zero cost from the end user
perspective. If what we end up with is a faster implementation of
numpy.ma in C I'm probably going to keep on using NaN... That's why
I'm entirely insistent that whatever design be dogfooded on non-expert
users. If it's very much harder / trickier / nuanced than R, you will
have failed.

 As for documentation, on hard/soft masks, just look at the docs for the
 MaskedArray constructor:
 [...snipped...]

 Thanks!

 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fixing up datetime

2011-06-09 Thread Wes McKinney
On Wed, Jun 8, 2011 at 8:53 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Wed, Jun 8, 2011 at 4:57 AM, Wes McKinney wesmck...@gmail.com wrote:

 snip


 So in summary, w.r.t. time series data and datetime, the only things I
 care about from a datetime / pandas point of view:

 - Ability to easily define custom timedeltas

 Can you elaborate on this a bit? I'm guessing you're not referring to the
 timedelta64 in NumPy, which is simply an integer with an associated unit.

I guess what I am thinking of may not need to be available at the
NumPy level. As an example, suppose you connected a timedelta
(DateOffset, in pandas parlance) to a set of holidays, so when you
say:

date + BusinessDay(1, calendar='US')

then you have some custom logic which knows how to describe the result
of that. Some things along these lines have already been discussed--
but this is the basic way that the date offsets work in pandas, i.e.
subclasses of DateOffset which implement custom logic. Probably too
high level for NumPy.


 - Generate datetime objects, or some equivalent, which can be used to
 back pandas data structures

 Do you mean mechanisms to generate sequences of datetime's? I'm fixing up
 arange to work reasonably with datetimes at the moment.

Yes, that will be very nice.


 - (possible now??) Ability to have a set of frequency-naive dates
 (possibly not in order).

 This should work just as well as if you have an arbitrary set of integers
 specifying the locations of the sample points.
 Cheers,
 Mark

Cool (!).


 This last point actually matters. Suppose you wanted to get the worst
 5-performing days in the SP 500 index:

 In [7]: spx.index
 Out[7]:
 class 'pandas.core.daterange.DateRange'
 offset: 1 BusinessDay, tzinfo: None
 [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00]
 length: 2963

 # but this is OK
 In [8]: spx.order()[:5]
 Out[8]:
 2008-10-15 00:00:00    -0.0903497960942
 2008-12-01 00:00:00    -0.0892952780505
 2008-09-29 00:00:00    -0.0878970494885
 2008-10-09 00:00:00    -0.0761670761671
 2008-11-20 00:00:00    -0.0671229140321

 - W
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fixing up datetime

2011-06-08 Thread Wes McKinney
On Wed, Jun 8, 2011 at 7:36 AM, Chris Barker chris.bar...@noaa.gov wrote:
 On 6/7/11 4:53 PM, Pierre GM wrote:
  Anyhow, each time yo
 read 'frequency' in scikits.timeseries, think 'unit'.

 or maybe precision -- when I think if unit, I think of something that
 can be represented as a floating point value -- but here, with integers,
 it's the precision that can be represented. Just a thought.

 Well, it can be argued that the epoch is 0...

 yes, but that really should be transparent to the user -- what epoch is
 chosen should influence as little as possible (e.g. only the range of
 values representable)

 Mmh. How would you define a quarter unit ? [3M] ? But then, what if
 you want your year to start in December, say (we often use
 DJF/MAM/JJA/SON as a way to decompose a year in four 'hydrological'
 seasons, for example)

 And the federal fiscal year is Oct - Sept, so the first quarter is (Oct,
 Nov, Dec) -- clearly that needs to be flexible.


 -Chris




 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Your guys' discussion is a bit overwhelming for me in my currently
jet-lagged state ( =) ) but I thought I would comment on a couple
things, especially now with the input of another financial Python user
(great!).

Note that I use scikits.timeseries very little for a few reasons (a
bit OT, but...):

- Fundamental need to be able to work with multiple time series,
especially performing operations involving cross-sectional data
- I think it's a bit hard for lay people to use (read: ex-MATLAB/R
users). This is just my opinion, but a few years ago I thought about
using it and concluded that teaching people how to properly use it (a
precision tool, indeed!) was going to cause me grief.
- The data alignment problem, best explained in code:

In [8]: ts
Out[8]:
2000-01-05 00:00:000.0503706684002
2000-01-12 00:00:00-1.7660004939
2000-01-19 00:00:001.11716758554
2000-01-26 00:00:00-0.171029995265
2000-02-02 00:00:00-0.99876580126
2000-02-09 00:00:00-0.262729046405

In [9]: ts.index
Out[9]:
class 'pandas.core.daterange.DateRange'
offset: 1 Week: kwds={'weekday': 2}, weekday=2, tzinfo: None
[2000-01-05 00:00:00, ..., 2000-02-09 00:00:00]
length: 6

In [10]: ts2 = ts[:4]

In [11]: ts2.index
Out[11]:
class 'pandas.core.daterange.DateRange'
offset: 1 Week: kwds={'weekday': 2}, weekday=2, tzinfo: None
[2000-01-05 00:00:00, ..., 2000-01-26 00:00:00]
length: 4

In [12]: ts + ts2
Out[12]:
2000-01-05 00:00:000.1007413368
2000-01-12 00:00:00-3.5320009878
2000-01-19 00:00:002.23433517109
2000-01-26 00:00:00-0.34205999053
2000-02-02 00:00:00NaN
2000-02-09 00:00:00NaN

Or ts / or ts2 could be completely DateRange-naive (e.g. they have no
way of knowing that they are fixed-frequency), or even out of order,
and stuff like this will work no problem. I view the fixed frequency
issue as sort of an afterthought-- if you need it, it's there for you
(the DateRange class is a valid Index--label vector--for pandas
objects, and provides an API for defining custom time deltas). Which
leads me to:

- Inability to derive custom offsets:

I can do:

In [14]: ts.shift(2, offset=2 * datetools.BDay())
Out[14]:
2000-01-11 00:00:000.0503706684002
2000-01-18 00:00:00-1.7660004939
2000-01-25 00:00:001.11716758554
2000-02-01 00:00:00-0.171029995265
2000-02-08 00:00:00-0.99876580126
2000-02-15 00:00:00-0.262729046405

or even generate, say, 5-minutely or 10-minutely date ranges thusly:

In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00',
offset=datetools.Minute(5))
Out[16]:
class 'pandas.core.daterange.DateRange'
offset: 5 Minutes, tzinfo: None
[2011-06-08 05:00:00, ..., 2011-06-08 12:00:00]
length: 85

I'm currently working on high perf reduceat-based resampling methods
(e.g. converting secondly data to 5-minutely data).

So in summary, w.r.t. time series data and datetime, the only things I
care about from a datetime / pandas point of view:

- Ability to easily define custom timedeltas
- Generate datetime objects, or some equivalent, which can be used to
back pandas data structures
- (possible now??) Ability to have a set of frequency-naive dates
(possibly not in order).

This last point actually matters. Suppose you wanted to get the worst
5-performing days in the SP 500 index:

In [7]: spx.index
Out[7]:
class 'pandas.core.daterange.DateRange'
offset: 1 BusinessDay, tzinfo: None
[1999-12-31 00:00:00, ..., 2011-05-10 00:00:00]
length: 2963

# but this is OK
In [8]: spx.order()[:5]
Out[8]:
2008-10-15 00:00:00-0.0903497960942
2008-12-01 00:00:00-0.0892952780505
2008-09-29 00:00:00-0.0878970494885

Re: [Numpy-discussion] fixing up datetime

2011-06-08 Thread Wes McKinney
On Wed, Jun 8, 2011 at 6:37 AM, Dave Hirschfeld
dave.hirschf...@gmail.com wrote:
 Wes McKinney wesmckinn at gmail.com writes:


 
  - Fundamental need to be able to work with multiple time series,
  especially performing operations involving cross-sectional data
  - I think it's a bit hard for lay people to use (read: ex-MATLAB/R
  users). This is just my opinion, but a few years ago I thought about
  using it and concluded that teaching people how to properly use it (a
  precision tool, indeed!) was going to cause me grief.
  - The data alignment problem, best explained in code:
 

 
  - Inability to derive custom offsets:
 
  I can do:
 
  In [14]: ts.shift(2, offset=2 * datetools.BDay())
  Out[14]:
  2000-01-11 00:00:00    0.0503706684002
  2000-01-18 00:00:00    -1.7660004939
  2000-01-25 00:00:00    1.11716758554
  2000-02-01 00:00:00    -0.171029995265
  2000-02-08 00:00:00    -0.99876580126
  2000-02-15 00:00:00    -0.262729046405
 
  or even generate, say, 5-minutely or 10-minutely date ranges thusly:
 
  In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00',
  offset=datetools.Minute(5))
  Out[16]:
  class 'pandas.core.daterange.DateRange'
  offset: 5 Minutes, tzinfo: None
  [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00]
  length: 85
 

 It would be nice to have a step argument in the date_array function. The
 following works though:

 In [96]: integers = r_[ts.Date('T',01-Aug-2011 05:00).value:
                       ts.Date('T',06-Aug-2011 12:01).value:
                       5]

 In [97]: ts.date_array(integers, freq='T')
 Out[97]:
 DateArray([01-Aug-2011 05:00, 01-Aug-2011 05:05, 01-Aug-2011 05:10, ...,
       06-Aug-2011 11:45, 06-Aug-2011 11:50, 06-Aug-2011 11:55, ...,
       06-Aug-2011 12:00],
          freq='T')


  - (possible now??) Ability to have a set of frequency-naive dates
  (possibly not in order).
 
  This last point actually matters. Suppose you wanted to get the worst
  5-performing days in the SP 500 index:
 
  In [7]: spx.index
  Out[7]:
  class 'pandas.core.daterange.DateRange'
  offset: 1 BusinessDay, tzinfo: None
  [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00]
  length: 2963
 
  # but this is OK
  In [8]: spx.order()[:5]
  Out[8]:
  2008-10-15 00:00:00    -0.0903497960942
  2008-12-01 00:00:00    -0.0892952780505
  2008-09-29 00:00:00    -0.0878970494885
  2008-10-09 00:00:00    -0.0761670761671
  2008-11-20 00:00:00    -0.0671229140321
 
  - W
 

 Seems like you're looking for the tssort function:

 In [90]: series = ts.time_series(randn(365),start_date='01-Jan-2011',freq='D')

 In [91]: def tssort(series):
   :     indices = series.argsort()
   :     return ts.time_series(series._series[indices],
   :                           series.dates[indices],
   :                           autosort=False)
   :

 In [92]: tssort(series)[0:5]
 Out[92]:
 timeseries([-2.96602612 -2.81612524 -2.61558511 -2.59522921 -2.4899447 ],
   dates = [26-Aug-2011 18-Apr-2011 27-Aug-2011 21-Aug-2011 19-Nov-2011],
   freq  = D)


 Pandas seems to offer an even higher level api than scikits.timeseries. I view
 them as mostly complementary but I haven't (yet) had much experience with
 pandas...

 -Dave



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I admit that my partial ignorance about scikits.timeseries (due to
seldom use) makes my statements a bit unfair-- but you're right in
that it's a high-level versus low-level thing. Just about everything
is possible but may require a certain level of experience /
sophistication / deep understanding (as you have).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fixing up datetime

2011-06-06 Thread Wes McKinney
On Mon, Jun 6, 2011 at 8:16 AM, Mark Dickinson mdickin...@enthought.com wrote:
 On Thu, Jun 2, 2011 at 5:42 PM, Mark Wiebe mwwi...@gmail.com wrote:
 Leap years are easy compared with leap seconds. Leap seconds involve a
 hardcoded table of particular leap-seconds that are added or subtracted, and
 are specified roughly 6 months in advance of when they happen by
 the International Earth Rotation and Reference Systems Service (IERS). The
 POSIX time_t doesn't store leap seconds, so if you subtract two time_t
 values you may get the wrong answer by up to 34 seconds (or 24, I'm not
 totally clear on the initial 10 second jump that happened somewhere).

 Times in the future would be hairy, too. If leap seconds are
 supported, how many seconds are there in the timedelta:

    2015-01-01 00:00:00  -  2010-01-01 00:00:00

 ?  Is it acceptable for the result of the subtraction like this to
 change when the leap second table is updated (e.g., to reflect a newly
 added leap second on June 30th, 2012)?

 Mark
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Tangential to the technical details being discussed, but my $0.02 on
datetime generally:

One thing I tried to do in pandas is enable the implementation of
custom time deltas (date offsets in pandas parlance). For example,
let's say add 5 weekdays (or business days) to a particular date. Now,
there are some arbitrary choices you have to make, e.g. what do you do
when the date being added to is not a business day? If you look in

https://github.com/wesm/pandas/blob/master/pandas/core/datetools.py

you can see some examples. Things are a bit of a mess in places but
the key ideas are there. One nice thing about this framework is that
you can create subclasses of DateOffset that connect to some external
source of information, e.g. trading exchange information-- so if you
want to shift by trading days for that exchange-- e.g. include
holidays and other closures, you can do that relatively seamlessly.
Downside is that it's fairly slow generally speaking.

I'm hopeful that the datetime64 dtype will enable scikits.timeseries
and pandas to consolidate much ofir the datetime / frequency code.
scikits.timeseries has a ton of great stuff for generating dates with
all the standard fixed frequencies.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fixing up datetime

2011-06-01 Thread Wes McKinney
On Wed, Jun 1, 2011 at 10:29 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Wed, Jun 1, 2011 at 3:16 PM, Mark Wiebe mwwi...@gmail.com wrote:

 On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:

 snip
 Just a quick comment, as this really needs more thought, but time is a
 bag of worms.

 Certainly a bag of worms, I agree.


 Trying to represent some standard -- say seconds at the solar system
 barycenter to account for general relativity -- is something that I think is
 too complicated and specialized to put into numpy.



 Good support for units and delta times is very useful,

 This part works fairly well now, except for some questions like what
 should datetime(2011-01-30, D) + timedelta(1, M) produce. Maybe
 2011-02-28, or 2011-03-02?


 but parsing dates and times

 I've implemented an almost-ISO 8601 date-time parser.  I had to deviate a
 bit to support years outside the little 1-year window we use. I think
 supporting more formats could be handled by writing a function which takes
 its date format and outputs ISO 8601 format to feed numpy.

 Heh, 1 years is way outside the window in which the Gregorian calender
 is useful anyway. Now, strict Julian days start at the creation of the world
 so that's a real standard, sort of like degrees Kelvins ;)


 and handling timezones, daylight savings,

 The only timezone to handle is local, which it looks like standard C has
 a rich enough API for this to be ok.


 leap seconds,

 This one can be mitigated by referencing TAI or GPS time, I think. POSIX
 time looks like a can of worms though.


 business days,

 I think Enthought has some interest in a decent business day
 functionality. Some feedback from people doing time series and financial
 modeling would help clarify this though. This link may provide some context:
 http://www.financialcalendar.com/Data/Holidays/Formats

 Well, that can change. Likewise, who wants to track ERS for the leap
 seconds?

 I think the important thing is that the low level support needed to
 implement such things is available. A package for common use cases that
 serves as an example and that can be extended/modified by others would also
 be helpful.

 Chuck

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



Being involved in financial data analysis (e.g. via pandas) this stuff
will affect me very directly. I'll get back to you soon with some
comments.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] labeled axes

2011-05-26 Thread Wes McKinney
On Tue, May 24, 2011 at 6:39 PM, Craig Yoshioka crai...@me.com wrote:
 Hi all,
 I've read some discussions about adding labeled axes, and even ticks, to
 numpy arrays  (such as in Luis' dataarray).
 I have recently found that the ability to label axes would be very helpful
 to me, but I'd like to keep the implementation as lightweight as possible.
 The reason I would find this useful is because I am writing a ndarray
 subclass that loads image/volume file formats into numpy arrays.  Some of
 these files might have multiple images/volumes, I'll call them channels, and
 also may have an additional dimension for vectors associated with each
 pixel/voxel, like color.  The max dims of the array would then be 5.
 Example: data = ndarray([1023,128,128,128,3]) might mean
 (channels,z,y,x,rgb) for one array.  Now I want to keep as much of the fancy
 indexing capabilities of numpy as I can, but I am finding it difficult to
 track the removal of axes that can occur from indexing.  For example
 data[2,2] would return an array of shape (128,128,3), or the third slice
 through the third volume in the dataset, but the returned array has lost the
 meaning associated with its axes, so saving it back out would require manual
 relabeling of the axes.   I'd like to be able to track the axes as metadata
 and retain all the fancy numpy indexing.
 There are two ways I could accomplish this with minimal code on the python
 side:
  One would be if indexing of the array always returned an array of the same
 dimensionality, that is data[2,2] returned an array of shape
 (1,1,128,128,3).  I could then delete the degenerate axes labels from the
 metadata, and return the compressed array, resulting in the same output:

 class Data(np.ndarray):
 def __getitem__(self,indices):
 data = np.ndarray.__getitem__(self,indices,donotcompress=True) # as an
 example
 data.axeslabels = [label for label,dim in zip(self.axeslabels,data.shape) if
 dim  1]
 return data.compress()
 def __getslice__(self,s1,s2,step):
 # trivial case

 Another approach would be if there is some function in the numpy internals
 that I could use to get the needed information before calling the ndarray's
 __getitem__ function:

 class Data(np.ndarray):
 def __getitem__(self,indices):
 unique = np.uniqueIndicesPerDimension(indices)
 data = np.ndarray.__getitem__(self,indices)
 data.axeslabels = [label for label,dim in zip(self.axeslabels, unique) if
 dim  1]
 return data

 Finally, I could implement my own parser for the passed indices to figure
 this out myself.  This would be bad since I'd have to recreate a lot of the
 same code that must go on inside numpy, and it would be slower, error-prone,
 etc. :

 class Data(np.ndarray):
 def __getitem__(self,indices):
 indices = self.uniqueDimensionIndices(indices)
 data = np.ndarray.__getitem__(self,indices)
 data.axeslabels = [label for label,dim in zip(self.axeslabels,indices) if
 dim  1]
 return data
 def uniqueDimensionIndices(self,indices):
 if isinstance(indices,int):
 indices = (indices,)
 if isinstance(indices,tuple):
 
 elif isinstance(indices,list):
 ...

 Is there anything in the numpy internals already that would allow me to do
 #1 or #2?, I don't think #3 is a very good option.
 Thanks!




 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



I would recommend joining or at least following the datarray project--
I believe it will do what you want, and we are working actively to
build out this functionality. Here are some links:

Datarray
Github: https://github.com/fperez/datarray
docs: http://fperez.github.com/datarray-doc/

Podcast we did about recent meeting about datarray
http://inscight.org/2011/05/18/episode_13/

Other projects to consider using: larry and pandas-- these support
data alignment which you may not care about. In pandas I'm only
concerned with data with ndim = 3, a bit specific to
statistics/econometrics/finance applications.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] datarray -- a couple of questions

2011-05-22 Thread Wes McKinney
On Sun, May 22, 2011 at 10:49 AM, Ben Walsh ben_w_...@yahoo.co.uk wrote:
 Hi

 I've been looking at various labelled-array packages, including Pandas
 (https://github.com/wesm/pandas) and datarray
 (https://github.com/fperez/datarray).

 I was quite interested in the design discussion for datarray, and I had
 a couple of questions:

 1) Is this mailing list the correct place to ask about this?

 2) Pandas supports mixed data types (eg. one column of floats, another
 column of strings). Does the datarray prototype support this? Is it
 simply a case of using a numpy.ndarray with dtype='object'?

 Cheers

 Ben
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


hi Ben,

others can comment, but I think the answer to 1) is yes (if you have
more specific usage questions about pandas feel free to ask on
pystatsmodels, too).

2) datarray as currently conceived as a homogeneously typed object. So
you could use dtype=object but occasionally have computational snafus.
pandas tries to avoid these problems by segregating the floating point
and non-floating point data. In recent discussions I do not think
datarray is intended to be a step-in replacement for something like
pandas, which is geared toward solving a bit more domain-specific set
of problems (e.g. R-like data manipulations, time series / finance
work, etc.). Rather, a future version of pandas will use datarray as
its base layer. So if you're looking for an (R) data.frame - like
container, pandas.{DataFrame, DataMatrix} is probably your best bet
for now

larry is also worth looking at. But like datarray is also does not
do mixed-type data.

cheers,
Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Bug? NumPy types circumvent some overloaded methods in ndarray subclasses

2011-05-07 Thread Wes McKinney
This strikes me as a bug-- haven't checked NumPy 1.6 yet but this
happens in 1.5.1. Here's a toy example:

class MyNdarray(np.ndarray):

def __new__(cls, data):
subarr = np.array(data, dtype=np.float64).view(cls)
return subarr

def __radd__(self, other):
print 'hello'
return (self.view(np.ndarray) + other).view(MyNdarray)

def __add__(self, other):
print 'hello'
return (self.view(np.ndarray) + other).view(MyNdarray)

In [25]: foo = MyNdarray(np.arange(10.))

In [26]: foo + 3
hello
Out[26]: MyNdarray([  3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.])

In [27]: foo + np.float64(3)
hello
Out[27]: MyNdarray([  3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.])

In [28]: 3 + foo
hello
Out[28]: MyNdarray([  3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.])

In [29]: np.float64(3) + foo
Out[29]: MyNdarray([  3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.])

I'm not sure if there's an easy workaround here. I thought maybe it
was an array priority issue but after some tinkering it doesn't look
like it. Any ideas?

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Data standardizing

2011-04-13 Thread Wes McKinney
On Wed, Apr 13, 2011 at 9:50 AM, Jonathan Rocher jroc...@enthought.com wrote:
 Hi,

 I assume you have this data in a txt file, correct? You can load up all of
 it in a numpy array using
 import numpy as np
 data = np.loadtxt(climat_file.txt, skiprows = 1)

 Then you can compute the mean you want by taking it on a slice of the data
 array. For example, if you want to compute the mean of your data in Jan for
 1950-1970 (say including 1970)
 mean1950_1970 = data[1950:1971,1].mean()

 Then the std deviation you want could be computed using
 my_std = np.sqrt(np.mean((data[:,1]-mean1950_1970)**2))

 Hope this helps,
 Jonathan

 On Tue, Apr 12, 2011 at 1:48 PM, Climate Research climatef...@gmail.com
 wrote:

 Hi
 I am purely new to python and numpy..  I am using python for doing
 statistical calculations to Climate data..

 I  have  a  data set in the following format..

 Year  Jan      feb   Mar    Apr.   Dec
 1900  1000    1001   ,    , ,
 1901  1011    1012   ,    , ,
 1902  1009    1007   ,  ,
     ,   '      ,    , ,
     ,   ,
 2010  1008    1002   ,    , ,

 I actually want to standardize each of these values with corresponding
 standard deviations for  each monthly data column..
 I have found out the standard deviations for each column..  but now i need
 to  find the standared deviation  only for a prescribed mean value
 ie,  when i am finding the standared deviation for the January data
 column..  the mean should be calculated only for the january data, say from
 1950-1970. With this mean  i  want to calculate the SD  for entire column.
 Any help will be appreciated..



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




 --
 Jonathan Rocher, PhD
 Scientific software developer
 Enthought, Inc.
 jroc...@enthought.com
 1-512-536-1057
 http://www.enthought.com


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



To standardize the data over each column you'll want to do:

(data - data.mean(axis=0)) / data.std(axis=0, ddof=1)

Note the broadcasting behavior of the (matrix - vector) operation--see
NumPy documentation for more details. The ddof=1 is there to give you
the (unbiased) sample standard deviation.

shameless plug

If you're looking for data structures to carry around your metadata
(dates and month labels), look to pandas (my project:
http://pandas.sourceforge.net/) or larry
(http://larry.sourceforge.net/).

/shameless plug

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Remove deprecated skiprows

2011-04-05 Thread Wes McKinney
On Tue, Apr 5, 2011 at 5:52 PM, Ralf Gommers
ralf.gomm...@googlemail.com wrote:
 On Tue, Apr 5, 2011 at 11:45 PM, Skipper Seabold jsseab...@gmail.com wrote:
 On Sun, Apr 3, 2011 at 8:20 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 Should skiprows be removed?

     if skiprows:
     warnings.warn(\
     The use of `skiprows` is deprecated, it will be removed in
 numpy 2.0.\n \
     Please use `skip_header` instead.,
     DeprecationWarning)
     skip_header = skiprows

 Its been deprecated since 1.4. Personally, I like skiprows better than
 skip_header ;)


 +1 for skiprows. I always have to look it up.

 To me one is not much better than the other, but -1 for skiprows
 because un-deprecating it and deprecating skip_header is inconsistent
 and annoying for users.

 Ralf
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


FYI:

http://docs.python.org/library/warnings.html

DeprecationWarning Base category for warnings about deprecated
features (ignored by default).

Maybe DeprecationWarnings should be changed to something that isn't
ignored by default so users see it?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Remove deprecated skiprows

2011-04-05 Thread Wes McKinney
On Tue, Apr 5, 2011 at 7:32 PM, Robert Kern robert.k...@gmail.com wrote:
 On Tue, Apr 5, 2011 at 18:20, Wes McKinney wesmck...@gmail.com wrote:

 FYI:

 http://docs.python.org/library/warnings.html

 DeprecationWarning     Base category for warnings about deprecated
 features (ignored by default).

 Maybe DeprecationWarnings should be changed to something that isn't
 ignored by default so users see it?

 The Python devs made DeprecationWarnings silent by default for a
 reason. They unnecessarily annoy and frighten application end users
 who don't know what they are, where they are coming from, or how to
 silence them. The recommended practice is for library users (who do or
 should know those things) to run their test suites with the warnings
 turned on.

 --
 Robert Kern

 I have come to believe that the whole world is an enigma, a harmless
 enigma that is made terrible by our own mad attempt to interpret it as
 though it had an underlying truth.
   -- Umberto Eco
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


You make a very convincing point-- makes complete sense.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Numpy-svn] aada93: ENH: Add 'subok' parameter to PyArray_NewLikeArray...

2011-03-15 Thread Wes McKinney
On Tue, Mar 15, 2011 at 3:25 PM, Sebastian Haase seb.ha...@gmail.com wrote:
 On Tue, Mar 15, 2011 at 7:22 PM,  nore...@github.com wrote:
 Branch: refs/heads/master
 Home:   https://github.com/numpy/numpy

 Commit: aada93306acfb4e2eb816faf32652edf8825cf45
    
 https://github.com/numpy/numpy/commit/aada93306acfb4e2eb816faf32652edf8825cf45
 Author: Mark Wiebe mwwi...@gmail.com
 Date:   2011-03-15 (Tue, 15 Mar 2011)

 Changed paths:
  M doc/source/reference/c-api.array.rst
  M numpy/add_newdocs.py
  M numpy/core/numeric.py
  M numpy/core/src/multiarray/convert.c
  M numpy/core/src/multiarray/ctors.c
  M numpy/core/src/multiarray/multiarraymodule.c
  M numpy/core/src/umath/ufunc_object.c
  M numpy/core/tests/test_numeric.py

 Log Message:
 ---
 ENH: Add 'subok' parameter to PyArray_NewLikeArray, np.empty_like, 
 np.zeros_like, and np.ones_like

 This way, the sub-type can be avoided if necessary. This helps mitigate,
 but doesn't fix, ticket #1753, by allowing b = np.empty_like(a, 
 subok=False).

 I'm really not in a position to comment on the depths of the numpy API,
 but my understanding of np.any_like( ... )  was that it would create always
 normal ndarrays just taking shape and dtype from the given array.
 So what should the interpretation of subok be ?
 Can you elaborate ... ?

 Thanks,
 Sebastian Haase
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I know that things like numpy.ma and pandas are directly impacted by
this change-- in NumPy  1.6 many API functions cannot be used on
subclasses because they discard any additional information you wish to
be passed on (like the mask in a masked array, for example).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] broadcasting behavior for 1.6 (was: Numpy 1.6 schedule)

2011-03-11 Thread Wes McKinney
On Fri, Mar 11, 2011 at 9:57 AM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Fri, Mar 11, 2011 at 7:42 AM, Charles R Harris
 charlesr.har...@gmail.com wrote:


 On Fri, Mar 11, 2011 at 2:01 AM, Ralf Gommers
 ralf.gomm...@googlemail.com wrote:

 I'm just going through the very long 1.6 schedule thread to see what
 is still on the TODO list before a 1.6.x branch can be made. So I'll
 send a few separate mails, one for each topic.

 On Mon, Mar 7, 2011 at 8:30 PM, Francesc Alted fal...@pytables.org
 wrote:
  A Sunday 06 March 2011 06:47:34 Mark Wiebe escrigué:
  I think it's ok to revert this behavior for backwards compatibility,
  but believe it's an inconsistent and unintuitive choice. In
  broadcasting, there are two operations, growing a dimension 1 - n,
  and appending a new 1 dimension to the left. The behaviour under
  discussion in assignment is different from normal broadcasting in
  that only the second one is permitted. It is broadcasting the output
  to the input, rather than broadcasting the input to the output.
 
  Suppose a has shape (20,), b has shape (1,20), and c has shape
  (20,1). Then a+b has shape (1,20), a+c has shape (20,20), and b+c
  has shape (20,20).
 
  If we do b[...] = a, a will be broadcast to match b by adding a 1
  dimension to the left. This is reasonable and consistent with
  addition.
 
  If we do a[...]=b, under 1.5 rules, a will once again be broadcast
  to match b by adding a 1 dimension to the left.
 
  If we do a[...]=c, we could broadcast both a and c together to the
  shape (20,20). This results in multiple assignments to each element
  of a, which is inconsistent. This is not analogous to a+c, but
  rather to np.add(c, c, out=a).
 
  The distinction is subtle, but the inconsistent behavior is harmless
  enough for assignment that keeping backwards compatibility seems
  reasonable.
 
  For what is worth, I also like the behaviour that Mark proposes, and
  have updated tables test suite to adapt to this.  But I'm fine if it is
  decided to revert to the previous behaviour.

 The conclusion on this topic, as I read the discussion, is that we
 need to keep backwards compatible behavior (even though the proposed
 change is more intuitive). Has backwards compatibility been fixed
 already?


 I don't think an official conclusion was reached, at least in so far as
 numpy has an official anything ;) But this change does show up as an error
 in one of the pandas tests, so it is likely to affect other folks as well.
 Probably the route of least compatibility hassle is to revert to the old
 behavior and maybe switch to the new behavior, which I prefer, for 2.0.


 That said, apart from pandas and pytables, and the latter has been fixed,
 the new behavior doesn't seem to have much fallout. I think it actually
 exposes unoticed assumptions in code that slipped by because there was no
 consequence.

 Chuck

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



I've fixed the pandas issue-- I'll put out a bugfix release whenever
NumPy 1.6 final is out. I don't suspect it will cause very many
problems (and those problems will--hopefully--be easy to fix).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] {zeros, empty, ...}_like functions behavior with subclasses

2011-03-01 Thread Wes McKinney
On Mon, Feb 28, 2011 at 10:52 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Mon, Feb 28, 2011 at 7:24 PM, Pierre GM pgmdevl...@gmail.com wrote:

 On Mar 1, 2011, at 1:05 AM, Bruce Southey wrote:

 On Mon, Feb 28, 2011 at 4:52 PM, Wes McKinney wesmck...@gmail.com wrote:
 I'm having some trouble with the zeros_like function via np.fix:

 def zeros_like(a):
    if isinstance(a, ndarray):
        res = ndarray.__new__(type(a), a.shape, a.dtype, order=a.flags.fnc)
        res.fill(0)
        return res
    try:
        wrap = a.__array_wrap__
    except AttributeError:
        wrap = None
    a = asarray(a)
    res = zeros(a.shape, a.dtype)
    if wrap:
        res = wrap(res)
    return res

 As you can see this is going to discard any metadata stored in a
 subtype. I'm not sure whether this is a bug or a feature but wanted to
 bring it up.

 Thanks,
 Wes
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 I guess this is ticket 929.
 http://projects.scipy.org/numpy/ticket/929

 I was looking at it today but was not sure what is really desired
 here. I considered that this just meant shape and dtype but not sure
 about masked or record arrays behavior. So:
 What is the value of having the metadata?
 What is the meaning of 'like' here?

 Well, that depends on what you wanna do, of course. To handle metadata, I 
 use some kind of dictionary updated in the __array_finalize__. Check 
 numpy.ma.MaskedArray and its subclasses (like scikits.timeseries.TimeSeries) 
 for the details.
 Now that you could store some extra data in the dtype (if I remmbr and 
 understand correctly), it might be worth considering a proper way to deal 
 with that.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 The ticket is exactly related to the problem at hand-- having
 __array_finalize__ defined won't help you as it never gets called.


Looks like this commit fixed the problem, so the ticket can be closed

https://github.com/numpy/numpy/commit/c9d1849332ae5bf73299ea1268f6a55f78624688#numpy/core/numeric.py
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] {zeros, empty, ...}_like functions behavior with subclasses

2011-02-28 Thread Wes McKinney
I'm having some trouble with the zeros_like function via np.fix:

def zeros_like(a):
if isinstance(a, ndarray):
res = ndarray.__new__(type(a), a.shape, a.dtype, order=a.flags.fnc)
res.fill(0)
return res
try:
wrap = a.__array_wrap__
except AttributeError:
wrap = None
a = asarray(a)
res = zeros(a.shape, a.dtype)
if wrap:
res = wrap(res)
return res

As you can see this is going to discard any metadata stored in a
subtype. I'm not sure whether this is a bug or a feature but wanted to
bring it up.

Thanks,
Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] {zeros, empty, ...}_like functions behavior with subclasses

2011-02-28 Thread Wes McKinney
On Mon, Feb 28, 2011 at 7:24 PM, Pierre GM pgmdevl...@gmail.com wrote:

 On Mar 1, 2011, at 1:05 AM, Bruce Southey wrote:

 On Mon, Feb 28, 2011 at 4:52 PM, Wes McKinney wesmck...@gmail.com wrote:
 I'm having some trouble with the zeros_like function via np.fix:

 def zeros_like(a):
    if isinstance(a, ndarray):
        res = ndarray.__new__(type(a), a.shape, a.dtype, order=a.flags.fnc)
        res.fill(0)
        return res
    try:
        wrap = a.__array_wrap__
    except AttributeError:
        wrap = None
    a = asarray(a)
    res = zeros(a.shape, a.dtype)
    if wrap:
        res = wrap(res)
    return res

 As you can see this is going to discard any metadata stored in a
 subtype. I'm not sure whether this is a bug or a feature but wanted to
 bring it up.

 Thanks,
 Wes
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 I guess this is ticket 929.
 http://projects.scipy.org/numpy/ticket/929

 I was looking at it today but was not sure what is really desired
 here. I considered that this just meant shape and dtype but not sure
 about masked or record arrays behavior. So:
 What is the value of having the metadata?
 What is the meaning of 'like' here?

 Well, that depends on what you wanna do, of course. To handle metadata, I use 
 some kind of dictionary updated in the __array_finalize__. Check 
 numpy.ma.MaskedArray and its subclasses (like scikits.timeseries.TimeSeries) 
 for the details.
 Now that you could store some extra data in the dtype (if I remmbr and 
 understand correctly), it might be worth considering a proper way to deal 
 with that.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


The ticket is exactly related to the problem at hand-- having
__array_finalize__ defined won't help you as it never gets called.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Wes McKinney
On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
matthieu.bruc...@gmail.com wrote:
 Use directly restrict in C99 mode (__restrict does not have exactly the same
 semantics).
 For a valgrind profil, you can check my blog
 (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
 Basically, if you have a python script, you can valgrind --optionsinmyblog
 python myscript.py
 For PAPI, you have to install several packages (perf module for kernel for
 instance) and a GUI to analyze the results (in Eclispe, it should be
 possible).
 Matthieu
 2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Thanks Matthieu,
 using __restrict__ with g++ did not change anything. How do I use
 valgrind with C extensions?
 I don't know what PAPI profil is ...?
 -Sebastian


 On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
  Hi,
  My first move would be to add a restrict keyword to dist (i.e. dist is
  the
  only pointer to the specific memory location), and then declare dist_
  inside
  the first loop also with a restrict.
  Then, I would run valgrind or a PAPI profil on your code to see what
  causes
  the issue (false sharing, ...)
  Matthieu
 
  2011/2/15 Sebastian Haase seb.ha...@gmail.com
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
                    double *a_ps, int nx1, int na,
                    double *b_ps, int nx2, int nb,
                    double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape
  (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
         for(i=0;ina;i++)
           {
                 //num_threads=omp_get_num_threads();  -- 4
                 dist_ = dist+i*nb;                 // dists_  is  only
  introduced for OpenMP
                 for(j=0;jnb;j++)
                   {
                         *dist_++  = sqrt( sq(a_ps[i*nx1]   -
  b_ps[j*nx2]) +
 
   sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
                   }
           }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



As an aside, do you have a GPU-equipped machine? This function looks
like it would be an easy CUDA target. Depends on who the users of the
function are (e.g. do they also have the CUDA runtime) if whether you
wanted to go that route.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Wes McKinney
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote:
 Wes,
 I think I should have a couple of GPUs. I would be ready for anything
 ... if you think that I could do some easy(!) CUDA programming here,
 maybe you could guide me into the right direction...
 Thanks,
 Sebastian.


 On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
 Use directly restrict in C99 mode (__restrict does not have exactly the same
 semantics).
 For a valgrind profil, you can check my blog
 (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
 Basically, if you have a python script, you can valgrind --optionsinmyblog
 python myscript.py
 For PAPI, you have to install several packages (perf module for kernel for
 instance) and a GUI to analyze the results (in Eclispe, it should be
 possible).
 Matthieu
 2011/2/15 Sebastian Haase seb.ha...@gmail.com

 Thanks Matthieu,
 using __restrict__ with g++ did not change anything. How do I use
 valgrind with C extensions?
 I don't know what PAPI profil is ...?
 -Sebastian


 On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
 matthieu.bruc...@gmail.com wrote:
  Hi,
  My first move would be to add a restrict keyword to dist (i.e. dist is
  the
  only pointer to the specific memory location), and then declare dist_
  inside
  the first loop also with a restrict.
  Then, I would run valgrind or a PAPI profil on your code to see what
  causes
  the issue (false sharing, ...)
  Matthieu
 
  2011/2/15 Sebastian Haase seb.ha...@gmail.com
 
  Hi,
  I assume that someone here could maybe help me, and I'm hoping it's
  not too much off topic.
  I have 2 arrays of 2d point coordinates and would like to calculate
  all pairwise distances as fast as possible.
  Going from Python/Numpy to a (Swigged) C extension already gave me a
  55x speedup.
  (.9ms vs. 50ms for arrays of length 329 and 340).
  I'm using gcc on Linux.
  Now I'm wondering if I could go even faster !?
  My hope that the compiler might automagically do some SSE2
  optimization got disappointed.
  Since I have a 4 core CPU I thought OpenMP might be an option;
  I never used that, and after some playing around I managed to get
  (only) 50% slowdown(!) :-(
 
  My code in short is this:
  (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
  ---Ccode --
  void dists2d(
                    double *a_ps, int nx1, int na,
                    double *b_ps, int nx2, int nb,
                    double *dist, int nx3, int ny3)  throw (char*)
  {
   if(nx1 != 2)  throw (char*) a must be of shape (n,2);
   if(nx2 != 2)  throw (char*) b must be of shape (n,2);
   if(nx3 != nb || ny3 != na)    throw (char*) c must be of shape
  (na,nb);
 
   double *dist_;
   int i, j;
 
  #pragma omp parallel private(dist_, j, i)
   {
  #pragma omp for nowait
         for(i=0;ina;i++)
           {
                 //num_threads=omp_get_num_threads();  -- 4
                 dist_ = dist+i*nb;                 // dists_  is  only
  introduced for OpenMP
                 for(j=0;jnb;j++)
                   {
                         *dist_++  = sqrt( sq(a_ps[i*nx1]   -
  b_ps[j*nx2]) +
 
   sq(a_ps[i*nx1+1]
  - b_ps[j*nx2+1]) );
                   }
           }
   }
  }
  ---/Ccode --
  There is probably a simple mistake in this code - as I said I never
  used OpenMP before.
  It should be not too difficult to use OpenMP correctly here
   or -  maybe better -
  is there a simple SSE(2,3,4) version that might be even better than
  OpenMP... !?
 
  I supposed, that I did not get the #pragma omp lines right - any idea ?
  Or is it in general not possible to speed this kind of code up using
  OpenMP !?
 
  Thanks,
  Sebastian Haase
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
  --
  Information System Engineer, Ph.D.
  Blog: http://matt.eifelle.com
  LinkedIn: http://www.linkedin.com/in/matthieubrucher
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 --
 Information System Engineer, Ph.D.
 Blog: http://matt.eifelle.com
 LinkedIn: http://www.linkedin.com/in/matthieubrucher

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 As an aside, do you have a GPU-equipped machine? This function looks
 like it would be an easy CUDA target. Depends on who the users of the
 function are (e.g. do they also have the CUDA runtime) if whether you
 wanted to go that route

Re: [Numpy-discussion] relational join

2011-02-02 Thread Wes McKinney
On Wed, Feb 2, 2011 at 4:46 PM, Robert Kern robert.k...@gmail.com wrote:
 On Wed, Feb 2, 2011 at 21:42, Ilya Shlyakhter ilya_...@alum.mit.edu wrote:
 Does numpy have a relational join operation for joining recordarrays?

 [~]
 |1 from numpy.lib import recfunctions

 [~]
 |2 recfunctions.join_by?
 Type:           function
 Base Class:     type 'function'
 String Form:    function join_by at 0x17bba30
 Namespace:      Interactive
 File:
 /Library/Frameworks/Python.framework/Versions/6.3/lib/python2.6/site-packages/numpy/lib/recfunctions.py
 Definition:     recfunctions.join_by(key, r1, r2, jointype='inner',
 r1postfix='1', r2postfix='2', defaults=None, usemask=True,
 asrecarray=False)
 Docstring:
    Join arrays `r1` and `r2` on key `key`.

    The key should be either a string or a sequence of string corresponding
    to the fields used to join the array.
    An exception is raised if the `key` field cannot be found in the two input
    arrays.
    Neither `r1` nor `r2` should have any duplicates along `key`: the presence
    of duplicates will make the output quite unreliable. Note that duplicates
    are not looked for by the algorithm.

    Parameters
    --
    key : {string, sequence}
        A string or a sequence of strings corresponding to the fields used
        for comparison.
    r1, r2 : arrays
        Structured arrays.
    jointype : {'inner', 'outer', 'leftouter'}, optional
        If 'inner', returns the elements common to both r1 and r2.
        If 'outer', returns the common elements as well as the elements of r1
        not in r2 and the elements of not in r2.
        If 'leftouter', returns the common elements and the elements of r1 not
        in r2.
    r1postfix : string, optional
        String appended to the names of the fields of r1 that are present in r2
        but absent of the key.
    r2postfix : string, optional
        String appended to the names of the fields of r2 that are present in r1
        but absent of the key.
    defaults : {dictionary}, optional
        Dictionary mapping field names to the corresponding default values.
    usemask : {True, False}, optional
        Whether to return a MaskedArray (or MaskedRecords is 
 `asrecarray==True`)
        or a ndarray.
    asrecarray : {False, True}, optional
        Whether to return a recarray (or MaskedRecords if `usemask==True`) or
        just a flexible-type ndarray.

    Notes
    -
    * The output is sorted along the key.
    * A temporary array is formed by dropping the fields not in the key for the
      two arrays and concatenating the result. This array is then sorted, and
      the common entries selected. The output is constructed by
 filling the fields
      with the selected entries. Matching is not preserved if there are some
      duplicates...

 --
 Robert Kern

 I have come to believe that the whole world is an enigma, a harmless
 enigma that is made terrible by our own mad attempt to interpret it as
 though it had an underlying truth.
   -- Umberto Eco
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


It might also be worth your while to check out Keith Goodman's la
(larry) library or my pandas library, which are both designed with
relational data in mind.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Wes McKinney
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Why make multiple packages? It seems like all these functions are
somewhat related: practical tools for real-world data analysis (where
observations are often missing). I suspect having everything under one
hood would create more interest than chopping things up-- would be
very useful to folks in many different disciplines (finance,
economics, statistics, etc.). In R, for example, NA-handling is just a
part of every day life. Of course in R there is a special NA value
which is distinct from NaN-- many folks object to the use of NaN for
missing values. The alternative is masked arrays, but in my case I
wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas
where I've handled things in a somewhat ad hoc way.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Wes McKinney
On Sun, Nov 21, 2010 at 6:02 PM,  josef.p...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 12:30 PM,  josef.p...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com 
 wrote:
 On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com 
 wrote:

 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 I've started working on moving window statistics cython functions. I
 plan to make it into a package called Roly (for rolling). The
 signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
 window, axis=-1), etc.

 I think of Nanny and Roly as two separate packages. A narrow focus is
 good for a new package. But maybe each package could be a subpackage
 in a super package?

 Would the function signatures in Nanny (exact duplicates of the
 corresponding functions in Numpy and Scipy) work for pandas? I plan to
 use Nanny in larry. I'll try to get the structure of the Nanny package
 in place. But if it doesn't attract any interest after that then I may
 fold it into larry.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Why make multiple packages? It seems like all these functions are
 somewhat related: practical tools for real-world data analysis (where
 observations are often missing). I suspect having everything under one
 hood would create more interest than chopping things up-- would be
 very useful to folks in many different disciplines (finance,
 economics, statistics, etc.). In R, for example, NA-handling is just a
 part of every day life. Of course in R there is a special NA value
 which is distinct from NaN-- many folks object to the use of NaN for
 missing values. The alternative is masked arrays, but in my case I
 wasn't willing to sacrifice so much performance for purity's sake.

 I could certainly use the nan* functions to replace code in pandas
 where I've handled things in a somewhat ad hoc way.

 A package focused on NaN-aware functions sounds like a good idea. I
 think a good plan would be to start by making faster, drop-in
 replacements for the NaN functions that are already in numpy and
 scipy. That is already a lot of work. After that, one possibility is
 to add stuff like nancumsum, nanprod, etc. After that moving window
 stuff?

 and maybe group functions after that?

 Yes, group functions are on my list.

 If there is a lot of repetition, you could use templating. Even simple
 string substitution, if it is only replacing the dtype, works pretty
 well. It would at least reduce some copy-paste.

 Unit test coverage should be good enough to mess around with trying
 templating. What's a good way to go? Write my own script that creates
 the .pyx file and call it from the make file? Or are there packages
 for doing the templating?

 Depends on the scale, I tried once with simple string templates
 http://codespeak.net/pipermail/cython-dev/2009-August/006614.html

 here is a pastbin of another version by (?),
 http://pastebin.com/f1a49143d discussed on the cython-dev mailing
 list.

 The cython list has the discussion every once in a while but I haven't
 seen any conclusion yet. For heavier duty templating a proper
 templating package (Jinja?) might be better.

 I'm not an expert.

 Josef



 I added nanmean (the first scipy function to enter nanny) and nanmin.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


What would you say to a single package that contains:

- NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.)
- moving window functions (moving_{count, sum, mean, var, std, etc.})
- core subroutines for labeled data
- group-by functions
- other things to add to this list?

In other words, basic building computational tools for making
libraries like larry, pandas, etc

Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-21 Thread Wes McKinney
On Sun, Nov 21, 2010 at 6:37 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Sun, Nov 21, 2010 at 3:16 PM, Wes McKinney wesmck...@gmail.com wrote:

 What would you say to a single package that contains:

 - NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.)

 I'd say yes.

 - moving window functions (moving_{count, sum, mean, var, std, etc.})

 Yes.

 BTW, we both do arr=arr.astype(float), I think, before doing the
 moving statistics. So I speeded things up by running the moving window
 backwards and writing the result in place.

 - core subroutines for labeled data

 Not sure what this would be. Let's discuss.

Basically want to produce a indexing vector based on rules-- something
to pass to ndarray.take later on. And maybe your generic binary-op
function from a while back?

 - group-by functions

 Yes. I have some ideas on function signatures.

 - other things to add to this list?

 A no-op function with a really long doc string!

 In other words, basic building computational tools for making
 libraries like larry, pandas, etc. and doing time series / statistical
 / other manipulations on real world (messy) data sets. The focus isn't
 so much NaN-awareness per se but more practical data wrangling. I
 would be happy to work on such a package and to move all the Cython
 code I've written into it. There's a little bit of datarray overlap
 potentially but I think that's OK

 Maybe we should make a list of function signatures along with brief
 doc strings to get a feel for what we (and hopefully others) have in
 mind?

I've personally never been much for writing specs, but could be
useful. We probably aren't going to get it all right on the first try,
so we'll just do our best and refactor the code later if necessary. We
might be well-served by collecting exemplary data sets and making a
list of things we would like to be able to do easily with that data.

But writing stuff like:

moving_{funcname}(ndarray data, int window, int axis=0, int
min_periods=window) - ndarray
group_aggregate(ndarray data, ndarray labels, int axis=0, function
agg_function) - ndarray
group_transform(...) ...

etc. makes sense

 Where should we continue the discussion? The pystatsmodels mailing
 list? By now the numpy list probably thinks of NaN as Not ANother
 email from this guy.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Maybe let's have the next thread on SciPy-user-- I think what we're
talking about is general enough to be discussed there.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-20 Thread Wes McKinney
On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com wrote:
 I should make a benchmark suite.

 ny.benchit(verbose=False)
 Nanny performance benchmark
    Nanny 0.0.1dev
    Numpy 1.4.1
    Speed is numpy time divided by nanny time
    NaN means all NaNs
   Speed   Test                Shape        dtype    NaN?
   6.6770  nansum(a, axis=-1)  (500,500)    int64
   4.6612  nansum(a, axis=-1)  (1,)     float64
   9.0351  nansum(a, axis=-1)  (500,500)    int32
   3.0746  nansum(a, axis=-1)  (500,500)    float64
  11.5740  nansum(a, axis=-1)  (1,)     int32
   6.4484  nansum(a, axis=-1)  (1,)     int64
  51.3917  nansum(a, axis=-1)  (500,500)    float64  NaN
  13.8692  nansum(a, axis=-1)  (1,)     float64  NaN
   6.5327  nanmax(a, axis=-1)  (500,500)    int64
   8.8222  nanmax(a, axis=-1)  (1,)     float64
   0.2059  nanmax(a, axis=-1)  (500,500)    int32
   6.9262  nanmax(a, axis=-1)  (500,500)    float64
   5.0688  nanmax(a, axis=-1)  (1,)     int32
   6.5605  nanmax(a, axis=-1)  (1,)     int64
  48.4850  nanmax(a, axis=-1)  (500,500)    float64  NaN
  14.6289  nanmax(a, axis=-1)  (1,)     float64  NaN

 You can also use the makefile to run the benchmark: make bench
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Keith (and others),

What would you think about creating a library of mostly Cython-based
domain specific functions? So stuff like rolling statistical
moments, nan* functions like you have here, and all that-- NumPy-array
only functions that don't necessarily belong in NumPy or SciPy (but
could be included on down the road). You were already talking about
this on the statsmodels mailing list for larry. I spent a lot of time
writing a bunch of these for pandas over the last couple of years, and
I would have relatively few qualms about moving these outside of
pandas and introducing a dependency. You could do the same for larry--
then we'd all be relying on the same well-vetted and tested codebase.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

2010-11-20 Thread Wes McKinney
On Sat, Nov 20, 2010 at 6:54 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com wrote:
 I should make a benchmark suite.

 ny.benchit(verbose=False)
 Nanny performance benchmark
    Nanny 0.0.1dev
    Numpy 1.4.1
    Speed is numpy time divided by nanny time
    NaN means all NaNs
   Speed   Test                Shape        dtype    NaN?
   6.6770  nansum(a, axis=-1)  (500,500)    int64
   4.6612  nansum(a, axis=-1)  (1,)     float64
   9.0351  nansum(a, axis=-1)  (500,500)    int32
   3.0746  nansum(a, axis=-1)  (500,500)    float64
  11.5740  nansum(a, axis=-1)  (1,)     int32
   6.4484  nansum(a, axis=-1)  (1,)     int64
  51.3917  nansum(a, axis=-1)  (500,500)    float64  NaN
  13.8692  nansum(a, axis=-1)  (1,)     float64  NaN
   6.5327  nanmax(a, axis=-1)  (500,500)    int64
   8.8222  nanmax(a, axis=-1)  (1,)     float64
   0.2059  nanmax(a, axis=-1)  (500,500)    int32
   6.9262  nanmax(a, axis=-1)  (500,500)    float64
   5.0688  nanmax(a, axis=-1)  (1,)     int32
   6.5605  nanmax(a, axis=-1)  (1,)     int64
  48.4850  nanmax(a, axis=-1)  (500,500)    float64  NaN
  14.6289  nanmax(a, axis=-1)  (1,)     float64  NaN

 You can also use the makefile to run the benchmark: make bench
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Keith (and others),

 What would you think about creating a library of mostly Cython-based
 domain specific functions? So stuff like rolling statistical
 moments, nan* functions like you have here, and all that-- NumPy-array
 only functions that don't necessarily belong in NumPy or SciPy (but
 could be included on down the road). You were already talking about
 this on the statsmodels mailing list for larry. I spent a lot of time
 writing a bunch of these for pandas over the last couple of years, and
 I would have relatively few qualms about moving these outside of
 pandas and introducing a dependency. You could do the same for larry--
 then we'd all be relying on the same well-vetted and tested codebase.

 - Wes


By the way I wouldn't mind pushing all of my datetime-related code
(date range generation, date offsets, etc.) into this new library,
too.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Anyone with Core i7 and Ubuntu 10.04?

2010-11-08 Thread Wes McKinney
On Mon, Nov 8, 2010 at 11:33 PM, David Warde-Farley
warde...@iro.umontreal.ca wrote:
 On 2010-11-08, at 8:52 PM, David wrote:

 Please tell us what error you got - saying that something did not
 working is really not useful to help you. You need to say exactly what
 fails, and which steps you followed before that failure.

 I think what he means is that it's very slow, there's no discernable error 
 but dot-multiplies don't seem to be using BLAS.

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Somewhat related topic: anyone know the status of EPD (Enthought
distribution) releases on i7 processors as far as this goes?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Financial TS models

2010-09-18 Thread Wes McKinney
On Sat, Sep 18, 2010 at 10:33 AM,  josef.p...@gmail.com wrote:
 On Sat, Sep 18, 2010 at 10:13 AM,  josef.p...@gmail.com wrote:
 On Sat, Sep 18, 2010 at 8:09 AM, Virgil Stokes v...@it.uu.se wrote:
  I am considering the development of an all Python package (with numpy and
 matplotlib) for the modeling and analysis of financial time series.

 This is a rather ambitious and could take sometime before I can have 
 something
 that is useful.  Thus, any suggestions, pointers, etc. to related work 
 would be
 appreciated.

 I should have just asked you:   What do you have in mind?
 instead of writing my notes from the readings I just did into the email.

 I think any open source contributions in this area will be very useful
 with the increased popularity of python in Finance.

 Josef



 Depends on what you want to do, but I would join or build on top of an
 existing package.

 I just got distracted with an extended internet search after finding

 http://github.com/quantmind/dynts

 (They use Redis as an in-memory and persistent storage. After reading
 up a bit, I think this might be useful if you have a web front end
 http://github.com/lsbardel/jflow in mind, but maybe not as good as
 hdf5 for desktop work. Just guessing since I used neither, and I
 always worry about installation problems on Windows.)
 They just started public development but all packages are in BSD from
 what I have seen.

 Otherwise, I would build on top of pandas, scikits.timeseries or larry
 or tabular if you want to handle your own time variable.

 For specific financial time series, e.g. stocks, exchange rates,
 options, I have seen only bits and pieces, or only partially
 implemented code (with a BSD compatible license), outside of quantlib
 and it's python bindings.

 Maybe someone knows more about what's available.

 For the econometrics/statistical analysis I haven't seen much outside
 of pandas and statsmodels in this area (besides isolated examples and
 recipes). I started to write on this in the statsmodels sandbox
 (including simulators).

  modeling and analysis of financial time series is a big project,
 and to get any results within a reasonable amount of time (unless you
 are part of a large team) is to specialize on some pieces.

 This is just my impression, since I thought of doing the same thing,
 but didn't see any way to get very far.

 (I just spend some weekends just to get the data from the Federal
 Reserve and wrap the API for the economics data base (Fred) of the
 Federal Reserve Saint Louis, the boring storage backend is zipped
 csv-files)

 Josef


 Thank you,
 --V
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I'm also interested as well what you might have in mind. For
econometric models, the place to start (and perhaps contribute) would
definitely be statsmodels. For other models it depends. Over the next
year or so I plan to be working as often as I can on implementing time
series models either inside statsmodels (with a pandas interface so
you can carry around metadata) or inside pandas itself (which is not
necessarily intended for financial applications, but that's what I've
used it for). I'm interested in both standard econometric models (see
e.g. Lütkepohl's 2006 time series book) and Bayesian time series
models.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy installation question (IronPython)

2010-07-13 Thread Wes McKinney
On Tue, Jul 13, 2010 at 8:26 AM, Sebastian Haase seb.ha...@gmail.com wrote:
 On Tue, Jul 13, 2010 at 2:20 PM, William Johnston willi...@tenbase2.com 
 wrote:
 Hello,

 I simply installed numpy in my Python26 installation, and then copied the
 numpy directory to my site-packages folder of my IronPython installation.

 Did I miss any installation steps in doing so?  The multiarray module could
 not be found using IronPython.

 Hi William,

 Why do you think that numpy works in IronPython ?
 I thought most Python modules work only with standard (C) Python 
 Numpy depends heavily on C implementations for most of its functionality.

 Regards,
 Sebastian Haase
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


William-- I think you will want to look here to be able to use NumPy
in IronPython for the time being:

http://code.google.com/p/ironclad/
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory leak found in ndarray (I think)?

2010-07-13 Thread Wes McKinney
On Mon, Jul 12, 2010 at 6:44 PM, Nathaniel Peterson
nathanielpeterso...@gmail.com wrote:
 Wes McKinney wesmck...@gmail.com writes:

 Did you mean to post a different link? That's the ticket I just created :)

 How silly of me! I meant http://projects.scipy.org/numpy/ticket/1427
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Looks to be caused by the same bug. I would suggest this should get
bumped up in priority for a fix? I am going to have a poke around the
C codebase but I don't feel competent enough to patch it myself.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Memory leak found in ndarray (I think)?

2010-07-12 Thread Wes McKinney
This one was quite a bear to track down, starting from the of course
very high level observation of why is my application leaking memory.
I've reproduced it on Windows XP using NumPy 1.3.0 on Python 2.5 and
1.4.1 on Python 2.6 (EPD). Basically it seems that calling
.astype(bool) on an ndarray slice with object dtype is leaving a
hanging reference count, should be pretty obvious to see:

from datetime import datetime
import numpy as np
import sys

def foo(verbose=True):
arr = np.array([datetime.today() for _ in xrange(1000)])
arr = arr.reshape((500, 2))
sl = arr[:, 0]

if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0])

for _ in xrange(10):
foo = sl.astype(bool)

if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0])

if __name__ == '__main__':
foo()
for i in xrange(1):
if not i % 1000: print i
foo(verbose=False)

On my machine this bleeds about 100 MB of memory that you don't get
back-- let me know if I've misinterpreted the results. I'll happily
create a ticket on the Trac page.

Thanks,
Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory leak found in ndarray (I think)?

2010-07-12 Thread Wes McKinney
On Mon, Jul 12, 2010 at 2:22 PM, Wes McKinney wesmck...@gmail.com wrote:
 This one was quite a bear to track down, starting from the of course
 very high level observation of why is my application leaking memory.
 I've reproduced it on Windows XP using NumPy 1.3.0 on Python 2.5 and
 1.4.1 on Python 2.6 (EPD). Basically it seems that calling
 .astype(bool) on an ndarray slice with object dtype is leaving a
 hanging reference count, should be pretty obvious to see:

 from datetime import datetime
 import numpy as np
 import sys

 def foo(verbose=True):
    arr = np.array([datetime.today() for _ in xrange(1000)])
    arr = arr.reshape((500, 2))
    sl = arr[:, 0]

    if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0])

    for _ in xrange(10):
        foo = sl.astype(bool)

    if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0])

 if __name__ == '__main__':
    foo()
    for i in xrange(1):
        if not i % 1000: print i
        foo(verbose=False)

 On my machine this bleeds about 100 MB of memory that you don't get
 back-- let me know if I've misinterpreted the results. I'll happily
 create a ticket on the Trac page.

 Thanks,
 Wes


Posted in Ticket #1542

http://projects.scipy.org/numpy/ticket/1542
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory leak found in ndarray (I think)?

2010-07-12 Thread Wes McKinney
On Mon, Jul 12, 2010 at 3:39 PM, Nathaniel Peterson
nathanielpeterso...@gmail.com wrote:
 This memory leak may be related: http://projects.scipy.org/numpy/ticket/1542
 It shows what appears to be a memory leak when calling astype('float')
 on an array of dtype 'object'.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Did you mean to post a different link? That's the ticket I just created :)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Wes McKinney
On Thu, Jul 8, 2010 at 1:35 PM, Rob Speer rsp...@mit.edu wrote:
 Forgive me if this is has already been addressed, but my question is
 what happens when we have more than one label (not as in a labeled
 axis but an observation label -- but not a tick because they're not
 unique!) per say row axis and heterogenous dtypes.  This is really the
 problem that I would like to see addressed and from the BoF comments
 I'm not sure this use case is going to be covered.  I'm also not sure
 I expressed myself clearly enough or understood what's already
 available.  For me, this is the single most common use case and most
 of what we are talking about now is just convenient slicing but
 ignoring some basic and prominent concerns.  Please correct me if I'm
 wrong.  I need to play more with DataArray implementation but haven't
 had time yet.

 I often have data that looks like this (not really, but it gives the
 idea in a general way I think).

 city, month, year, region, precipitation, temperature
 Austin, January, 1980, South, 12.1, 65.4,
 Austin, February, 1980, South, 24.3, 55.4
 Austin, March, 1980, South, 3, 69.1
 
 Austin, December, 2009, 1, 62.1
 Boston, January, 1980, Northeast, 1.5, 19.2
 
 Boston,December, 2009, Northeast, 2.1, 23.5
 ...
 Memphis,January,1980, South, 2.1, 35.6
 ...
 Memphis,December,2009, South, 1.2, 33.5
 ...

 Your labels are unique if you look at them the right way. Here's how I
 would represent that in a datarray:
 * axis0 = 'city', ['Austin', 'Boston', ...]
 * axis1 = 'month', ['January', 'February', ...]
 * axis2 = 'year', [1980, 1981, ...]
 * axis3 = 'region', ['Northeast', 'South', ...]
 * axis4 = 'measurement', ['precipitation', 'temperature']

 and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2,
 axis3, axis4].

 Now I realize not everyone wants to represent their tabular data as a
 big tensor that they index every which way, and I think this is one
 thing that pandas is for.

 Oh, and the other problem with the 5-D datarray is that you'd probably
 want it to be sparse. This is another discussion worth having.

I have thought quite a bit about the sparsity problem as well. I took
a first crack at a sparse data structure for panel (3D) data called
LongPanel, so basically each row has two labels, and you can fairly
efficiently convert to the dense (3D) form. It's also capable of
constructing dummy variables for a fixed effects regression. There of
course per Skipper's question you will have nearly always have
duplicate labels-- I bet it's something we could generalize. It's also
very much related to the group-by procedures we've discussed.

 I want to eventually replace the labeling stuff in Divisi with
 datarray, but sparse matrices are largely the point of using Divisi.
 So how do we make a sparse datarray?

 One answer would be to have datarray be a wrapper that encapsulates
 any sufficiently matrix-like type. This is approximately what I did in
 the now-obsolete Divisi1. Nobody liked the fact that you had to wrap
 and unwrap your arrays to accomplish anything that we hadn't thought
 of in writing Divisi. I would not recommend this route.

 The other option, which is more like Divisi2. would be to provide the
 functionality of datarray using a mixin. Then a standard dense
 datarray could inherit from (np.ndarray, Datarray), while a sparse
 datarray could inherit from (sparse.csr_matrix, Datarray), for
 example.

 -- Rob
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-06 Thread Wes McKinney
On Tue, Jul 6, 2010 at 12:56 PM, Keith Goodman kwgood...@gmail.com wrote:
 On Tue, Jul 6, 2010 at 9:52 AM, Joshua Holbrook josh.holbr...@gmail.com 
 wrote:
 On Tue, Jul 6, 2010 at 8:42 AM, Skipper Seabold jsseab...@gmail.com wrote:
 On Tue, Jul 6, 2010 at 12:36 PM, Joshua Holbrook
 josh.holbr...@gmail.com wrote:
 I'm kinda-sorta still getting around to building/reading the sphinx
 docs for datarray. _ Like, I've gone through them before, but it was
 more cursory than I'd like. Honestly, I kinda let myself get caught up
 in trying to automate the process of getting them onto github pages.

 I have to admit that I didn't 100% understand the reasoning behind not
 allowing integer ticks (I blame jet lag--it's a nice scapegoat). I
 believe it originally had to do with what you meant if you typed, say,
 A[3:london]; Did you mean the underlying ndarray index 3, or the
 outer level tick 3? I think if you didn't allow integers, then you
 could simply wrap your 3 in a string: A[3:London] so it's
 probably not a deal-breaker, but I would imagine that using (a)
 separate method(s) for label-based indexing may make allowing
 integer-datatyped labels.

 Thoughts?

 Would you mind bottom-posting/ posting in-line to make the thread
 easier to follow?


 --Josh

 On Tue, Jul 6, 2010 at 8:23 AM, Keith Goodman kwgood...@gmail.com wrote:
 On Tue, Jul 6, 2010 at 9:13 AM, Skipper Seabold jsseab...@gmail.com 
 wrote:
 On Tue, Jul 6, 2010 at 11:55 AM, Keith Goodman kwgood...@gmail.com 
 wrote:
 On Tue, Jul 6, 2010 at 7:47 AM, Joshua Holbrook 
 josh.holbr...@gmail.com wrote:
 I really really really want to work on this. I already forked datarray
 on github and did some research on What Other People Have Done (
 http://jesusabdullah.github.com/2010/07/02/datarray.html ). With any
 luck I'll contribute something actually useful. :)

 I like the figure!

 To do label indexing on a larry you need to use lix, so lar.lix[...]

 FYI, if you didn't see it, there are also usage docs in dataarray/doc
 that you can build with sphinx that show a lot of the thinking and
 examples (they spent time looking at pandas and larry).

 One question that was asked of Wes, that I'd propose to you as well
 Keith, is that if DataArray became part of NumPy, do you think you
 could use it to work on top of for larry?

 This is all very exciting. I did not know that DataArray had ticks so
 I never took a close look at it.

 After reading the sphinx doc, one question I had was how firm is the
 decision to not allow integer ticks? I use int ticks a lot.

 I think what Josh said is right.  However, we proposed having all of
 the new labeled axis access pushed to a .aix (or whatever) method, so
 as to avoid any confusion, as the original object can be accessed just
 as an ndarray.  I'm not sure where this leaves us vis-a-vis ints as
 ticks.

 Skipper
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Sorry re: posting at-top. I guess habit surpassed observation of
 community norms for a second there. Whups!

 My opinion on the matter is that, as a matter of purity, labels
 should all have the string datatype. That said, I'd imagine that
 passing an int as an argument would be fine, due to python's
 loosey-goosey attitude towards datatypes. :) That, or, y'know,
 str(myint).

 Ideally (for me), the only requirement for ticks would be hashable and
 unique along any one axis. So, for example, datetime.date() could be a
 tick but a list could not be a tick (not hashable).
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Gmail needs to really get its act and enable bottom-posting by
default. Definitely an annoyance

There are many issues at play here so I wanted to give some of my
thoughts re: building pandas, larry, etc. on top of DataArray (or
whatever it is that makes its way into NumPy), can put this on the
wiki, too:

1. Giving semantic information to axes (not ticks, though)

I think this is very useful but wouldn't be immediately useful in
pandas except perhaps moving axis names elsewhere (which are currently
a part of the data-structures and always have the same name). I
wouldn't be immediately comfortable say, making a pandas DataFrame a
subclass of DataArray and making them implicitly interoperable. Going
back and forth e.g. from DataArray and DataFrame *should* be an easy
operation-- you could imagine using DataArray to serialize both pandas
and larry objects for example!

2. Container for axis metadata (Axis object in datarray, Index in pandas, ...)

I would be more than happy to offload the ordered set data structure
onto NumPy. In pandas, Index is that container-- it's an ndarray
subclass with a handful of methods and a reverse index (e.g. if you
have ['d', 'b', 'a' 'c'] you have a dict somewhere with {'d' : 0, 'b'
: 1, 

Re: [Numpy-discussion] indexing question

2010-06-21 Thread Wes McKinney
On Mon, Jun 21, 2010 at 7:10 PM, Robert Kern robert.k...@gmail.com wrote:
 On Mon, Jun 21, 2010 at 17:42, Neal Becker ndbeck...@gmail.com wrote:
 Robert Kern wrote:

 On Mon, Jun 21, 2010 at 14:01, Neal Becker ndbeck...@gmail.com wrote:
 Can I find an efficient way to do this?

 I have a 2d array, A, 80 rows by 880 columns.

 I have a vector, B, of length 80, with scalar indexes.

 I assume you mean 880.

 I want a vector output C where
 C[i] = A[b[i],i] (i=0,879)

 C = A[b, np.arange(880)]


 Thanks!  Just what I needed.

 I wouldn't have guessed this.  Do we have a wiki to save useful examples
 like this?

 This kind of indexing is documented here:

  http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#integer

 Various examples are on the wiki here:

  http://www.scipy.org/Cookbook/Indexing

 --
 Robert Kern

 I have come to believe that the whole world is an enigma, a harmless
 enigma that is made terrible by our own mad attempt to interpret it as
 though it had an underlying truth.
  -- Umberto Eco
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


And to be completely pedantic one could use ndarray.take to do this
about 3-4x faster (but less intuitively):

A.take(b * 880 + np.arange(880))
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Tools / data structures for statistical analysis and related applications

2010-06-11 Thread Wes McKinney
On Fri, Jun 11, 2010 at 9:46 AM, Bruce Southey bsout...@gmail.com wrote:
 On 06/09/2010 03:40 PM, Wes McKinney wrote:
 Dear all,

 We've been having discussions on the pystatsmodels mailing list
 recently regarding data structures and other tools for statistics /
 other related data analysis applications.  I believe we're trying to
 answer a number of different, but related questions:

 1. What are the sets of functionality (and use cases) which would be
 desirable for the scientific (or statistical) Python programmer?
 Things like groupby
 (http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst)
 fall into this category.

 2. Do we really need to build custom data structures (larry, pandas,
 tabular, etc.) or are structured ndarrays enough? (My conclusion is
 that we do need to, but others might disagree). If so, how much
 performance are we willing to trade for functionality?

 3. What needs to happen for Python / NumPy / SciPy to really break
 in to the statistical computing field? In other words, could a
 Python-based stack one day be a competitive alternative to R?

 These are just some ideas for collecting community input. Of course as
 we're all working in different problem domains, the needs of users
 will vary quite a bit across the board. We've started to collect some
 thoughts, links, etc. on the scipy.org wiki:

 http://scipy.org/StatisticalDataStructures

 A lot of what's there already is commentary and comparison on the
 functionality provided by pandas and la / larry (since Keith and I
 wrote most of the stuff there). But I think we're trying to identify
 more generally the things that are lacking in NumPy/SciPy and related
 libraries for particular applications. At minimum it should be good
 fodder for the SciPy conferences this year and afterward (I am
 submitting a paper on this subject based on my experiences).

 - Wes
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 If you need pure data storage then all you require is an timeseries,
 masked structured ndarray. That will handle time/dates, missing values
 and named variables. This is probably the basis of all statistical
 packages, databases and spreadsheets. But the real problem is the
 blas/lapack usage that prevents anything but an standard narray.

For storing data sets I can agree that a structured / masked ndarray
is sufficient. But I think a lot of people are primarily concerned
about data manipulations in memory (which can be currently quite
obtuse). If you are referring to scikits.timeseries-- it expects data
to be fixed frequency which is a too rigid assumption for many
applications (like mine).


 The issue that I have with all these packages like tabulate, la and
 pandas that extend narrays is the 'upstream'/'downstream' problem of
 open source development. The real problem with these extensions of numpy
 is that while you can have whatever storage you like, you either need to
 write your own functions or preprocess the storage into an acceptable
 form. So you have to rely on those extensions being update with
 numpy/scipy since a 'fix' upstream can cause havoc downstream. I

In theory this could be a problem but of all packages to depend on in
the Python ecosystem, NumPy seems pretty safe. How many API breakages
have there been in ndarray in the last few years? Inherently this is a
risk of participating in open-source. After more than 2 years of
running a NumPy-SciPy based stack in production applications I feel
pretty comfortable. And besides, we write unit tests for a reason,
right?

 subscribe to what other have said elsewhere in the open source community
 in that it is very important to get your desired features upstream to
 the original project source - preferably numpy but scipy also counts.

From my experience developing pandas it's not clear to me what I've
done that _should_ make its way upstream into NumPy and / or SciPy.
You could imagine some form of high-level statistical data structure
making its way into scipy.stats but I'm not sure. If NumPy could
incorporate something like R's NA value without substantially
degrading performance then that would be a boon to the issue of
handling missing data (which MaskedArray does do for us-- but at
non-trivial performance loss). Data alignment routines, groupby (which
is already on the table), and NaN / missing data sensitive moving
window functions (mean, median, std, etc.) would be nice general
additions as well. Any other ideas?



 Bruce
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Tools / data structures for statistical analysis and related applications

2010-06-09 Thread Wes McKinney
Dear all,

We've been having discussions on the pystatsmodels mailing list
recently regarding data structures and other tools for statistics /
other related data analysis applications.  I believe we're trying to
answer a number of different, but related questions:

1. What are the sets of functionality (and use cases) which would be
desirable for the scientific (or statistical) Python programmer?
Things like groupby
(http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst)
fall into this category.

2. Do we really need to build custom data structures (larry, pandas,
tabular, etc.) or are structured ndarrays enough? (My conclusion is
that we do need to, but others might disagree). If so, how much
performance are we willing to trade for functionality?

3. What needs to happen for Python / NumPy / SciPy to really break
in to the statistical computing field? In other words, could a
Python-based stack one day be a competitive alternative to R?

These are just some ideas for collecting community input. Of course as
we're all working in different problem domains, the needs of users
will vary quite a bit across the board. We've started to collect some
thoughts, links, etc. on the scipy.org wiki:

http://scipy.org/StatisticalDataStructures

A lot of what's there already is commentary and comparison on the
functionality provided by pandas and la / larry (since Keith and I
wrote most of the stuff there). But I think we're trying to identify
more generally the things that are lacking in NumPy/SciPy and related
libraries for particular applications. At minimum it should be good
fodder for the SciPy conferences this year and afterward (I am
submitting a paper on this subject based on my experiences).

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Wes McKinney
On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote:
 On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
 On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu  
 wrote:
 I guess it's as fast as I'm going to get. I don't really see any
 other way. BTW, the lat/lons are integers)

 You could (in c or cython) try a brain-dead hashtable with no
 collision detection:

 for lat, long, data in dataset:
    bin = (lat ^ long) % num_bins
    hashtable[bin] = update_incremental_mean(hashtable[bin], data)

 you'll of course want to do some experiments to see if your data are
 sufficiently sparse and/or you can afford a large enough hashtable
 array that you won't get spurious hash collisions. Adding error-
 checking to ensure that there are no collisions would be pretty
 trivial (just keep a table of the lat/long for each hash value, which
 you'll need anyway, and check that different lat/long pairs don't get
 assigned the same bin).

 Zach



 -Mathew

 On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu
 wrote:
 Hi
 Can anyone think of a clever (non-lopping) solution to the
 following?

 A have a list of latitudes, a list of longitudes, and list of data
 values. All lists are the same length.

 I want to compute an average  of data values for each lat/lon pair.
 e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
 data[1001] = (data[1001] + data[2001])/2

 Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
    grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in
 grouped.items())

 Is that fast enough?

 If the lat lon can be converted to a 1d label as Wes suggested, then
 in a similar timing exercise ndimage was the fastest.
 http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html

 And as you said your lats and lons are integers, you could simply do

 ll = lat*1000 + lon

 to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
 lon will never exceed 360 (degrees).

 After that, either use the ndimage approach, or you could use
 histogramming with weighting by data values and divide by histogram
 withouth weighting, or just loop.

 Vincent


 (this was for python 2.4, also later I found np.bincount which
 requires that the labels are consecutive integers, but is as fast as
 ndimage)

 I don't know how it would compare to the new suggestions.

 Josef




 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I was curious about how fast ndimage was for this operation so here's
the complete function.

import scipy.ndimage as ndi

N = 1

lat = np.random.randint(0, 360, N)
lon = np.random.randint(0, 360, N)
data = np.random.randn(N)

def group_mean(lat, lon, data):
indexer = np.lexsort((lon, lat))
lat = lat.take(indexer)
lon = lon.take(indexer)
sorted_data = data.take(indexer)

keys = 1000 * lat + lon
unique_keys = np.unique(keys)

result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
decoder = keys.searchsorted(unique_keys)

return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))

Appears to be about 13x faster (and could be made faster still) than
the naive version on my machine:

def group_mean_naive(lat, lon, data):
grouped = collections.defaultdict(list)
for lt, ln, da in zip(lat, lon, data):
  grouped[(lt, ln)].append(da)

averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())

return averaged

I had to get the latest scipy trunk to not get an error from ndimage.mean
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Wes McKinney
On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote:
 thanks. I am also getting an error in ndi.mean
 Were you getting the error
 RuntimeError: data type not supported?

 -Mathew

 On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote:
  On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
  On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu
   wrote:
  I guess it's as fast as I'm going to get. I don't really see any
  other way. BTW, the lat/lons are integers)
 
  You could (in c or cython) try a brain-dead hashtable with no
  collision detection:
 
  for lat, long, data in dataset:
     bin = (lat ^ long) % num_bins
     hashtable[bin] = update_incremental_mean(hashtable[bin], data)
 
  you'll of course want to do some experiments to see if your data are
  sufficiently sparse and/or you can afford a large enough hashtable
  array that you won't get spurious hash collisions. Adding error-
  checking to ensure that there are no collisions would be pretty
  trivial (just keep a table of the lat/long for each hash value, which
  you'll need anyway, and check that different lat/long pairs don't get
  assigned the same bin).
 
  Zach
 
 
 
  -Mathew
 
  On Tue, Jun 1, 2010 at 1:49 PM, Zachary
  Pincuszachary.pin...@yale.edu
  wrote:
  Hi
  Can anyone think of a clever (non-lopping) solution to the
  following?
 
  A have a list of latitudes, a list of longitudes, and list of data
  values. All lists are the same length.
 
  I want to compute an average  of data values for each lat/lon pair.
  e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
  data[1001] = (data[1001] + data[2001])/2
 
  Looping is going to take wa to long.
 
  As a start, are the equal lat/lon pairs exactly equal (i.e. either
  not floating-point, or floats that will always compare equal, that
  is,
  the floating-point bit-patterns will be guaranteed to be identical)
  or
  approximately equal to float tolerance?
 
  If you're in the approx-equal case, then look at the KD-tree in scipy
  for doing near-neighbors queries.
 
  If you're in the exact-equal case, you could consider hashing the
  lat/
  lon pairs or something. At least then the looping is O(N) and not
  O(N^2):
 
  import collections
  grouped = collections.defaultdict(list)
  for lt, ln, da in zip(lat, lon, data):
     grouped[(lt, ln)].append(da)
 
  averaged = dict((ltln, numpy.mean(da)) for ltln, da in
  grouped.items())
 
  Is that fast enough?
 
  If the lat lon can be converted to a 1d label as Wes suggested, then
  in a similar timing exercise ndimage was the fastest.
  http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
 
  And as you said your lats and lons are integers, you could simply do
 
  ll = lat*1000 + lon
 
  to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
  lon will never exceed 360 (degrees).
 
  After that, either use the ndimage approach, or you could use
  histogramming with weighting by data values and divide by histogram
  withouth weighting, or just loop.
 
  Vincent
 
 
  (this was for python 2.4, also later I found np.bincount which
  requires that the labels are consecutive integers, but is as fast as
  ndimage)
 
  I don't know how it would compare to the new suggestions.
 
  Josef
 
 
 
 
  Zach
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 I was curious about how fast ndimage was for this operation so here's
 the complete function.

 import scipy.ndimage as ndi

 N = 1

 lat = np.random.randint(0, 360, N)
 lon = np.random.randint(0, 360, N)
 data = np.random.randn(N)

 def group_mean(lat, lon, data):
    indexer = np.lexsort((lon, lat))
    lat = lat.take(indexer)
    lon = lon.take(indexer)
    sorted_data = data.take(indexer)

    keys = 1000 * lat + lon
    unique_keys = np.unique(keys)

    result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
    decoder = keys.searchsorted(unique_keys)

    return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))

 Appears to be about 13x faster (and could be made faster still) than
 the naive version on my machine:

 def group_mean_naive(lat, lon, data):
    grouped = collections.defaultdict(list)
    for lt, ln, da in zip(lat, lon, data):
      grouped[(lt, ln)].append(da

Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Wes McKinney
On Wed, Jun 2, 2010 at 2:26 PM,  josef.p...@gmail.com wrote:
 On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote:
 I'm on Windows, using a precompiled binary. I never built numpy/scipy on
 Windows.

 ndimage measurements has been recently rewritten. ndimage is very fast
 but (the old version) has insufficient type checking and may crash on
 wrong inputs.

 I managed to work with it in the past on Windows. Maybe you could try
 to check the dtypes of the arguments for ndi.mean. (Preferably in an
 interpreter session where you don't mind if it crashes)

 I don't remember the type restrictions, but there are/were several
 tickets for it.

 Josef



 On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com
 wrote:
  thanks. I am also getting an error in ndi.mean
  Were you getting the error
  RuntimeError: data type not supported?
 
  -Mathew
 
  On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com
  wrote:
 
  On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl
  wrote:
   On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
   On Tue, Jun 1, 2010 at 9:57 PM, Zachary
   Pincuszachary.pin...@yale.edu
    wrote:
   I guess it's as fast as I'm going to get. I don't really see any
   other way. BTW, the lat/lons are integers)
  
   You could (in c or cython) try a brain-dead hashtable with no
   collision detection:
  
   for lat, long, data in dataset:
      bin = (lat ^ long) % num_bins
      hashtable[bin] = update_incremental_mean(hashtable[bin], data)
  
   you'll of course want to do some experiments to see if your data
   are
   sufficiently sparse and/or you can afford a large enough hashtable
   array that you won't get spurious hash collisions. Adding error-
   checking to ensure that there are no collisions would be pretty
   trivial (just keep a table of the lat/long for each hash value,
   which
   you'll need anyway, and check that different lat/long pairs don't
   get
   assigned the same bin).
  
   Zach
  
  
  
   -Mathew
  
   On Tue, Jun 1, 2010 at 1:49 PM, Zachary
   Pincuszachary.pin...@yale.edu
   wrote:
   Hi
   Can anyone think of a clever (non-lopping) solution to the
   following?
  
   A have a list of latitudes, a list of longitudes, and list of
   data
   values. All lists are the same length.
  
   I want to compute an average  of data values for each lat/lon
   pair.
   e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
   data[1001] = (data[1001] + data[2001])/2
  
   Looping is going to take wa to long.
  
   As a start, are the equal lat/lon pairs exactly equal (i.e.
   either
   not floating-point, or floats that will always compare equal, that
   is,
   the floating-point bit-patterns will be guaranteed to be
   identical)
   or
   approximately equal to float tolerance?
  
   If you're in the approx-equal case, then look at the KD-tree in
   scipy
   for doing near-neighbors queries.
  
   If you're in the exact-equal case, you could consider hashing the
   lat/
   lon pairs or something. At least then the looping is O(N) and not
   O(N^2):
  
   import collections
   grouped = collections.defaultdict(list)
   for lt, ln, da in zip(lat, lon, data):
      grouped[(lt, ln)].append(da)
  
   averaged = dict((ltln, numpy.mean(da)) for ltln, da in
   grouped.items())
  
   Is that fast enough?
  
   If the lat lon can be converted to a 1d label as Wes suggested, then
   in a similar timing exercise ndimage was the fastest.
   http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
  
   And as you said your lats and lons are integers, you could simply do
  
   ll = lat*1000 + lon
  
   to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat
   or
   lon will never exceed 360 (degrees).
  
   After that, either use the ndimage approach, or you could use
   histogramming with weighting by data values and divide by histogram
   withouth weighting, or just loop.
  
   Vincent
  
  
   (this was for python 2.4, also later I found np.bincount which
   requires that the labels are consecutive integers, but is as fast as
   ndimage)
  
   I don't know how it would compare to the new suggestions.
  
   Josef
  
  
  
  
   Zach
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion

  1   2   >