[Numpy-discussion] ANN: pandas 0.11.0 released!
hi all, We've released pandas 0.11.0, a big release that span 3 months of continuous development, led primarily by the intrepid Jeff Reback and y-p. The release brings many new features, performance and API improvements, bug fixes, and other goodies. Some highlights: - New precision indexing fields loc, iloc, at, and iat, to reduce occasional ambiguity in the catch-all hitherto ix method. - Expanded support for NumPy data types in DataFrame - NumExpr integration to accelerate various operator evaluation - New Cookbook and 10 minutes to pandas pages in the documentation by Jeff Reback - Improved DataFrame to CSV exporting performance - Experimental rplot branch with faceted plots with matplotlib merged and open for community hacking Source archives and Windows installers are on PyPI. Thanks to all who contributed to this release, especially Jeff and y-p. What's new: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html Installers: http://pypi.python.org/pypi/pandas $ git log v0.10.1..v0.11.0 --pretty=format:%aN | sort | uniq -c | sort -rn 308 y-p 279 jreback 85 Vytautas Jancauskas 74 Wes McKinney 25 Stephen Lin 22 Andy Hayden 19 Chang She 13 Wouter Overmeire 8 Spencer Lyon 6 Phillip Cloud 6 Nicholaus E. Halecky 5 Thierry Moisan 5 Skipper Seabold 4 waitingkuo 4 Loïc Estève 4 Jeff Reback 4 Garrett Drapala 4 Alvaro Tejero-Cantero 3 lexual 3 Dražen Lučanin 3 dieterv77 3 dengemann 3 Dan Birken 3 Adam Greenhall 2 Will Furnass 2 Vytautas Jančauskas 2 Robert Gieseke 2 Peter Prettenhofer 2 Jonathan Chambers 2 Dieter Vandenbussche 2 Damien Garaud 2 Christopher Whelan 2 Chapman Siu 2 Brad Buran 1 vytas 1 Tim Akinbo 1 Thomas Kluyver 1 thauck 1 stephenwlin 1 K.-Michael Aye 1 Karmel Allison 1 Jeremy Wagner 1 James Casbon 1 Illia Polosukhin 1 Dražen Lučanin 1 davidjameshumphreys 1 Dan Davison 1 Chris Withers 1 Christian Geier 1 anomrake Happy data hacking! - Wes What is it == pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational, time series, or any other kind of labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Links = Release Notes: http://github.com/pydata/pandas/blob/master/RELEASE.rst Documentation: http://pandas.pydata.org Installers: http://pypi.python.org/pypi/pandas Code Repository: http://github.com/pydata/pandas Mailing List: http://groups.google.com/group/pydata ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: pandas 0.10.1 is released
hi all, We've released pandas 0.10.1 which includes many bug fixes from 0.10.0 (including a number of issues with the new file parser, e.g. reading multiple files in separate threads), various performance improvements, and major new PyTables/HDF5-based functionality contributed by Jeff Reback. I strongly recommend that all users upgrade. Thanks to all who contributed to this release, especially Chang She, Jeff Reback, and Yoval P. As always source archives and Windows installers are on PyPI. What's new: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html Installers: http://pypi.python.org/pypi/pandas $ git log v0.10.0..v0.10.1 --pretty=format:%aN | sort | uniq -c | sort -rn 66 jreback 59 Wes McKinney 43 Chang She 12 y-p 5 Vincent Arel-Bundock 4 Damien Garaud 3 Christopher Whelan 3 Andy Hayden 2 Jay Parlar 2 Dan Allan 1 Thouis (Ray) Jones 1 svaksha 1 herrfz 1 Garrett Drapala 1 elpres 1 Dieter Vandenbussche 1 Anton I. Sipos Happy data hacking! - Wes What is it == pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational, time series, or any other kind of labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Links = Release Notes: http://github.com/pydata/pandas/blob/master/RELEASE.rst Documentation: http://pandas.pydata.org Installers: http://pypi.python.org/pypi/pandas Code Repository: http://github.com/pydata/pandas Mailing List: http://groups.google.com/group/pydata ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: pandas 0.10.0 released
hi all, I'm super excited to announce the pandas 0.10.0 release. This is a major release including a new high performance file reading engine with tons of new user-facing functionality as well, a bunch of work on the HDF5/PyTables integration layer, much-expanded Unicode support, a new option/configuration interface, integration with the Google Analytics API, and a wide array of other new features, bug fixes, and performance improvements. I strongly recommend that all users get upgraded as soon as feasible. Many performance improvements made are quite substantial over 0.9.x, see vbenchmarks at the end of the e-mail. As of this release, we are no longer supporting Python 2.5. Also, this is the first release to officially support Python 3.3. Note: there are a number of minor, but necessary API changes that long-time pandas users should pay attention to in the What's New. Thanks to all who contributed to this release, especially Chang She, Yoval P, and Jeff Reback (and everyone else listed in the commit log!). As always source archives and Windows installers are on PyPI. What's new: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html Installers: http://pypi.python.org/pypi/pandas $ git log v0.9.1..v0.10.0 --pretty=format:%aN | sort | uniq -c | sort -rn 246 Wes McKinney 140 y-p 99 Chang She 45 jreback 18 Abraham Flaxman 17 Jeff Reback 14 locojaydev 11 Keith Hughitt 5 Adam Obeng 2 Dieter Vandenbussche 1 zach powers 1 Luke Lee 1 Laurent Gautier 1 Ken Van Haren 1 Jay Bourque 1 Donald Curtis 1 Chris Mulligan 1 alex arsenovic 1 A. Flaxman Happy data hacking! - Wes What is it == pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational, time series, or any other kind of labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Links = Release Notes: http://github.com/pydata/pandas/blob/master/RELEASE.rst Documentation: http://pandas.pydata.org Installers: http://pypi.python.org/pypi/pandas Code Repository: http://github.com/pydata/pandas Mailing List: http://groups.google.com/group/pydata Performance vs. v0.9.0 == Benchmarks from https://github.com/pydata/pandas/tree/master/vb_suite Ratio 1 means that v0.10.0 is faster v0.10.0 v0.9.0 ratio name unstack_sparse_keyspace 1.2813 144.1262 0.0089 groupby_frame_apply_overhead 20.1520 337.3330 0.0597 read_csv_comment2 25.3097 363.2860 0.0697 groupbym_frame_apply 75.1554 504.1661 0.1491 frame_iteritems_cached 0.0711 0.3919 0.1815 read_csv_thou_vb 35.2690 191.9360 0.1838 concat_small_frames12.901955.3561 0.2331 join_dataframe_integer_2key 5.818421.5823 0.2696 series_value_counts_strings 5.382419.1262 0.2814 append_frame_single_homogenous 0.3413 0.9319 0.3662 read_csv_vb18.408446.9500 0.3921 read_csv_standard 12.065129.9940 0.4023 panel_from_dict_all_different_indexes 73.6860 158.2949 0.4655 frame_constructor_ndarray 0.0471 0.0958 0.4918 groupby_first 3.8502 7.1988 0.5348 groupby_last3.6962 6.7792 0.5452 panel_from_dict_two_different_indexes 50.742886.4980 0.5866 append_frame_single_mixed 1.2950 2.1930 0.5905 frame_get_numeric_data 0.0695 0.1119 0.6212 replace_fillna 4.6349 7.0540 0.6571 frame_to_csv 281.9340 427.7921 0.6590 replace_replacena 4.7154 7.1207 0.6622 frame_iteritems 2.5862 3.7463 0.6903 series_align_int64_index 29.737041.2791 0.7204 join_dataframe_integer_key 1.7980 2.4303 0.7398 groupby_multi_size 31.006641.7001 0.7436 groupby_frame_singlekey_integer 2.3579 3.1649 0.7450 write_csv_standard326.8259 427.3241 0.7648 groupby_simple_compress_timing 41.211352.3993 0.7865 frame_fillna_inplace 16.284320.0491 0.8122 reindex_fillna_backfill 0.1364 0.1667 0.8181 groupby_multi_series_op15.291418.6651 0.8193 groupby_multi_cython 17.216920.4420 0.8422 frame_fillna_many_columns_pad
Re: [Numpy-discussion] memory-efficient loadtxt
On Monday, October 1, 2012, Chris Barker wrote: Paul, Nice to see someone working on these issues, but: I'm not sure the problem you are trying to solve -- accumulating in a list is pretty efficient anyway -- not a whole lot overhead. But if you do want to improve that, it may be better to change the accumulating method, rather than doing the double-read thing. Ive written, and posted here, code that provides an Acumulator that uses numpy internally, so not much memory overhead. In the end, it's not any faster than accumulating in a list and then converting to an array, but it does use less memory. I also have a Cython version that is not quite done (darn regular job getting in the way) that is both faster and more memory efficient. Also, frankly, just writing the array pre-allocation and re-sizeing code into loadtxt would not be a whole lot of code either, and would be both fast and memory efficient. Let mw know if you want any of my code to play with. However, I got the impression that someone was working on a More Advanced (TM) C-based file reader, which will replace loadtxt; yes -- I wonder what happened with that? Anyone? -CHB this patch is intended as a useful thing to have while we're waiting for that to appear. The patch passes all tests in the test suite, and documentation for the kwarg has been added. I've modified all tests to include the seekable kwarg, but that was mostly to check that all tests are passed also with this kwarg. I guess it's bit too late for 1.7.0 though? Should I make a pull request? I'm happy to take any and all suggestions before I do. Cheers Paul ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org javascript:; http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov javascript:; ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org javascript:; http://mail.scipy.org/mailman/listinfo/numpy-discussion I've finally built a new, very fast C-based tokenizer/parser with type inference, NA-handling, etc. for pandas sporadically over the last month-- it's almost ready to ship. It's roughly an order of magnitude faster than loadtxt and uses very little temporary space. Should be easy to push upstream into NumPy to replace the innards of np.loadtxt if I can get a bit of help with the plumbing (it already yields structured arrays in addition to pandas DataFrames so there isn't a great deal that needs doing). Blog post with CPU and memory benchmarks to follow-- will post a link here. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy 1.7 release delays
On Tue, Jun 26, 2012 at 8:15 PM, Skipper Seabold jsseab...@gmail.com wrote: On Tue, Jun 26, 2012 at 7:59 PM, Fernando Perez fperez@gmail.com wrote: On Tue, Jun 26, 2012 at 1:10 PM, Travis Oliphant tra...@continuum.io wrote: One issues is the one that Sage identified about the array interface regression as noted by Jason. Any other regressions from 1.5.x need to be addressed as well. We'll have to decide on a case-by-case basis if there are issues that conflict with 1.6.x behavior. One thing this discussion made me think about, is that it would be great to identify a few key projects that: - use numpy heavily - have reasonably solid test suites and create a special build job that runs *those* test suites periodically. Not necessarily on every last numpy commit, but at least on a reasonable schedule. I think having that kind of information readily available, and with the ability to switch which numpy branch/commit those tests do get run against, could be very valuable as an early warning system for numpy to know if an apparently inconsequential change has unexpected side effects downstream. In IPython we've really benefited greatly from our improved CI infrastructure, but that only goes as far as catching *our own* problems. This kind of downstream integration testing could be very useful. +1. Was thinking the same thing. My uninformed opinion from the sidelines: For me, this begged the question of why projects would wait so long and be upgrading 1.5.x - 1.7.x. it sounded to me like an outreach problem. The whole point of having release candidates is so that downstream users (and especially big public downstream libraries) can test the release candidate and give feedback on any changes that affect them. This feedback step is especially crucial for a project without 100% test coverage (for new code and old)... Putting more restrictions on changes that can be made in releases doesn't seem to me to be the right fix, though, admittedly, numpy is a bit of a different beast than other projects. I would think you would want downstream projects not to wait 2 years to upgrade and skip a couple of minor releases. Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion +1. We've begun running pandas's test suite internally against NumPy git master on Jenkins. It has already turned up bugs and behavior changes in a few short weeks. We should definitely do this on a more grand scale (especially since pandas 0.8.0 is now littered with hacks around NumPy 1.6 datetime bugs. fortunately nothing was fatally broken but it came close). - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Created NumPy 1.7.x branch
On Thu, Jun 21, 2012 at 2:49 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Thu, Jun 21, 2012 at 5:25 PM, Travis Oliphant tra...@continuum.io wrote: I thought it was clear we were doing a 1.7 release before SciPy. It seems pretty urgent that we get something out sooner than later. I know there is never enough time to do all the things we want to do. There is time before the first Release candidate to make changes on the 1.7.x branch. If you want to make the changes on master, and just indicate the Pull requests, Ondrej can make sure they are added to the 1.7.x. branch by Monday. We can also delay the first Release Candidate by a few days to next Wednesday and then bump everything 3 days if that will help. There will be a follow-on 1.8 release before the end of the year --- so there is time to make changes for that release as well. The next release will not take a year to get out, so we shouldn't feel pressured to get *everything* in this release. What about http://projects.scipy.org/numpy/ticket/2108? Someone needs to at least answer the question of how much of datetime is unusable on Windows with the current code. If that's not a lot then perhaps this is not a blocker, but we did consider it one until now. pandas has become a heavy consumer of datetime64 recently, and we haven't had any issues using VS2003 and VS2008, but haven't tested heavily against NumPy compiled with mingw outside of the version shipped in Enthought Python Distribution (the test suite passes fine, last time I checked). Of the other tickets (http://projects.scipy.org/numpy/report/3) it would also be good to get an assessment of which ones are critical. Perhaps none of them are and the branch is in good shape for a release, but some of those segfaults would be nice to have fixed. Debian multi-arch support too, as discussed on this list recently. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Enum/Factor NEP (now with code)
On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith n...@pobox.com wrote: On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney wesmck...@gmail.com wrote: It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object It seems like there are three obvious options, from most to least general: 1) Allow levels to be an arbitrary collection of hashable Python objects 2) Allow levels to be a homogenous collection of objects of any arbitrary numpy dtype 3) Allow levels to be chosen a few fixed types (strings and ints, I guess) I agree that (3) is a bit limiting. (1) is probably easier to implement than (2). (2) is the most general, since of course arbitrary Python object is a dtype. Is it useful to be able to restrict levels to be of homogenous type? The main difference between dtypes and python types is that (most) dtype scalars can be unboxed -- is that substantively useful for levels? What is the story for NA values (NaL?) in a factor array? I code them as -1 in the labels, though you could use INT32_MAX or something. This is very important in the context of groupby operations. If we have a type restriction on levels (options (2) or (3) above), then how to handle out-of-bounds values is quite a problem, yeah. Once we have NA dtypes then I suppose we could use those, but we don't yet. It's tempting to just error out of any operation that encounters such values. Nathaniel: my experience (see blog posting above for a bit more) is that khash really crushes PyDict for two reasons: you can use it with primitive types and avoid boxing, and secondly you can preallocate. Its memory footprint with large hashtables is also a fraction of PyDict. The Python memory allocator is not problematic-- if you create millions of Python objects expect the RAM usage of the Python process to balloon absurdly. Right, I saw that posting -- it's clear that khash has a lot of advantages as internal temporary storage for a specific operation like groupby on unboxed types. But I can't tell whether those arguments still apply now that we're talking about a long-term storage representation for data that has to support a variety of operations (many of which would require boxing/unboxing, since the API is in Python), might or might not use boxed types, etc. Obviously this also depends on which of the three options above we go with -- unboxing doesn't even make sense for option (1). -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'm in favor of option #2 (a lite version of what I'm doing currently-- I handle a few dtypes (PyObject, int64, datetime64, float64), though you'd have to go the code-generation route for all the dtypes to keep yourself sane if you do that. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Enum/Factor NEP (now with code)
On Wed, Jun 13, 2012 at 2:12 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven bry...@continuum.io wrote: On 6/13/12 8:33 AM, Nathaniel Smith wrote: Hi Bryan, I skimmed over the diff: https://github.com/bryevdv/numpy/compare/master...enum It was a bit hard to read since it seems like about half the changes in that branch are datatime cleanups or something? I hope you'll separate those out -- it's much easier to review self-contained changes, and the more changes you roll together into a big lump, the more risk there is that they'll get lost all together. I'm not quite sure what happened there, my git skills are not advanced by any measure. I think the datetime changes are a much smaller fraction than fifty percent, but I will see what I can do to separate them out in the near future. Looking again, it looks like a lot of it is actually because when I asked github to show me the diff between your branch and master, it showed me the diff between your branch and *your repository's* version of master. And your branch is actually based off a newer version of 'master' than you have in your repository. So, as far as git and github are concerned, all those changes that are included in your-branch's-base-master but not in your-repo's-master are new stuff that you did on your branch. Solution is just to do git push your github remote name master From the updated NEP I actually understand the use case for open types now, so that's good :-). But I don't think they're actually workable, so that's bad :-(. The use case, as I understand it, is for when you want to extend the levels set on the fly as you read through a file. The problem with this is that it produces a non-deterministic level ordering, where level 0 is whatever was seen first in the file, level 1 is whatever was seen second, etc. E.g., say I have a CSV file I read in: subject,initial_skill,skill_after_training 1,LOW,HIGH 2,LOW,LOW 3,HIGH,HIGH ... With the scheme described in the NEP, my initial_skill dtype will have levels [LOW, HIGH], and by skill_after_training dtype will have levels [HIGH,LOW], which means that their storage will be incompatible, comparisons won't work (or will have to go through some I imagine users using the same open dtype object in both fields of the structure dtype used to read in the file, if both fields of the file contain the same categories. If they don't contain the same categories, they are incomparable in any case. I believe many users have this simpler use case where each field is a separate category, and they want to read them all individually, separately on the fly. For these simple cases, it would just work. For your case example there would definitely be a documentation, examples, tutorials, education issue, to avoid the gotcha you describe. Yes, of course we *could* write the code to implement these open dtypes, and then write the documentation, examples, tutorials, etc. to help people work around their limitations. Or, we could just implement np.fromfile properly, which would require no workarounds and take less code to boot. nasty convert-to-string-and-back path), etc. Another situation where this will occur is if you have multiple data files in the same format; whether or not you're able to compare the data from them will depend on the order the data happens to occur in in each file. The solution is that whenever we automagically create a set of levels from some data, and the user hasn't specified any order, we should pick an order deterministically by sorting the levels. (This is also what R does. levels(factor(c(a, b))) - a, b. levels(factor(c(b, a))) - a, b.) A solution is to create the dtype object when reading in the first file, and to reuse that same dtype object when reading in subsequent files. Perhaps it's not ideal, but it does enable the work to be done. So would a proper implementation of np.fromfile that normalized the level ordering. Can you explain why you're using khash instead of PyDict? It seems to add a *lot* of complexity -- like it seems like you're using about as many lines of code just marshalling data into and out of the khash as I used for my old npenum.pyx prototype (not even counting all the extra work required to , and AFAICT my prototype has about the same amount of functionality as this. (Of course that's not entirely fair, because I was working in Cython... but why not work in Cython?) And you'll need to expose a Python dict interface sooner or later anyway, I'd think? I suppose I agree with the sentiment that the core of NumPy really ought to be less dependent on the Python C API, not more. I also think the khash API is pretty dead simple and straightforward, and the fact that it is contained in a singe header is attractive. It's also quite performant in time and space. But if others disagree strongly, all of it's
Re: [Numpy-discussion] Enum/Factor NEP (now with code)
On Wed, Jun 13, 2012 at 5:19 PM, Bryan Van de Ven bry...@continuum.io wrote: On 6/13/12 1:54 PM, Wes McKinney wrote: OK, I need to spend some time on this as it will directly impact me. Random thoughts here. It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object (note, I'm going to change the name of the Factor class to Categorical before 0.8.0 final per discussion with Nathaniel): The current for-discussion prototype currently only supports strings. I had mentioned integral levels in the NEP but wanted to get more feedback first. It looks like you are using intervals as levels in things like qcut? This would add some complexity. I can think of a couple of possible approaches I will have to try a few of them out to see what would make the most sense. The API for constructing an enum/factor/categorical array from fixed levels and an array of labels seems somewhat weak to me. A very common scenario is to need to construct a factor from an array of integers with an associated array of levels: In [13]: labels Out[13]: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4]) In [14]: levels Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [15]: Factor(labels, levels) Out[15]: Factor: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4]) Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) I originally had a very similar interface in the NEP. I was persuaded by Mark that this would be redundant: In [10]: levels = np.factor(['a', 'b', 'c']) # or levels = np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels) Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], dtype='factor({'c': 2, 'a': 0, 'b': 1})') This should also spell even more closely to your example as: labels.astype(levels) but I have not done much with casting yet, so this currently complains. However, would this satisfy your needs (modulo the separate question about more general integral or object levels)? What is the story for NA values (NaL?) in a factor array? I code them as -1 in the labels, though you could use INT32_MAX or something. This is very important in the context of groupby operations. I am just using INT32_MIN at the moment. Are the levels ordered (Nathaniel brought this up already looks like)? It doesn't look like it. That is also necessary. You also need to be They currently compare based on their value: In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], np.factor({'c':0, 'b':1, 'a':2})) In [21]: arr Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], dtype='factor({'c': 0, 'a': 2, 'b': 1})') In [22]: arr.sort() In [23]: arr Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'], dtype='factor({'c': 0, 'a': 2, 'b': 1})') able to sort the levels (which is a relabeling, I have lots of code in use for this). In the context of groupby in pandas, when processing a key (array of values) to a factor to be used for aggregating some data, you have the option of returning an object that has the levels as observed in the data or sorting. Sorting can obviously be very expensive depending on the number of groups in the data (http://wesmckinney.com/blog/?p=437). Example: from pandas import DataFrame from pandas.util.testing import rands import numpy as np df = DataFrame({'key' : [rands(10) for _ in xrange(10)] * 10, 'data' : np.random.randn(100)}) In [32]: timeit df.groupby('key').sum() 1 loops, best of 3: 374 ms per loop In [33]: timeit df.groupby('key', sort=False).sum() 10 loops, best of 3: 185 ms per loop The factorization time for the `key` column dominates the runtime; the factor is computed once then reused if you keep the GroupBy object around: In [36]: timeit grouped.sum() 100 loops, best of 3: 6.05 ms per loop Just some numbers for comparison. Factorization times: In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] In [42]: levels = np.factor(lets) In [43]: data = [lets[int(x)] for x in np.random.randn(100)] In [44]: %timeit np.array(data, levels) 10 loops, best of 3: 137 ms per loop And retrieving group indicies/summing: In [8]: %timeit arr=='a' 1000 loops, best of 3: 1.52 ms per
Re: [Numpy-discussion] Status of np.bincount
On Thu, May 3, 2012 at 12:51 PM, Tony Yu tsy...@gmail.com wrote: On Thu, May 3, 2012 at 9:57 AM, Robert Kern robert.k...@gmail.com wrote: On Thu, May 3, 2012 at 2:50 PM, Robert Elsner ml...@re-factory.de wrote: Am 03.05.2012 15:45, schrieb Robert Kern: On Thu, May 3, 2012 at 2:24 PM, Robert Elsner ml...@re-factory.de wrote: Hello Everybody, is there any news on the status of np.bincount with respect to big numbers? It seems I have just been bitten by #225. Is there an efficient way around? I found the np.histogram function painfully slow. Below a simple script, that demonstrates bincount failing with a memory error on big numbers import numpy as np x = np.array((30e9,)).astype(int) np.bincount(x) Any good idea how to work around it. My arrays contain somewhat 50M entries in the range from 0 to 30e9. And I would like to have them bincounted... You need a sparse data structure, then. Are you sure you even have duplicates? Anyways, I won't work out all of the details, but let me sketch something that might get you your answers. First, sort your array. Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x)) to find the indices where each sorted value changes over to the next. The np.diff() of that should give you the size of each. Use np.unique to get the sorted unique values to match up with those sizes. Fixing all of the off-by-one errors and dealing with the boundary conditions correctly is left as an exercise for the reader. ?? I suspect that this mail was meant to end up in the thread about sparse array data? No, I am responding to you. Hi Robert (Elsner), Just to expand a bit on Robert Kern's explanation: Your problem is only partly related to Ticket #225. Even if that is fixed, you won't be able to call `bincount` with an array containing `30e9` unless you implement something using sparse arrays because `bincount` wants return an array that's `30e9 + 1` in length, which isn't going to happen. -Tony ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion hi Robert, I suggest you try the value_counts instance method on pandas.Series: In [9]: ints = np.random.randint(0, 30e9, size=10) In [10]: all_ints = Series(ints.repeat(500)) In [11]: all_ints.value_counts() Out[11]: 16420382874500 7147863689 500 4019588415 500 17462388002500 11956087699500 1498988500 3811318398 500 6333517765 500 16077665866500 17559759721500 5898309082 500 25213150655500 17877388690500 3122117900 500 6242860212 500 ... 6344689036 500 16817048573500 16361777055500 4376828961 500 15910505187500 12051499627500 23857610954500 24557975709500 28135006018500 1661624653 500 6747702840 500 24601775145500 7290769930 500 9417075109 500 12071596222500 Length: 10 This method uses a C hash table and takes about 1 second to compute the bin counts for 50mm entries and 100k unique values. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Issue Tracking
On Wed, May 2, 2012 at 9:48 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, May 1, 2012 at 11:47 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Wed, May 2, 2012 at 1:48 AM, Pauli Virtanen p...@iki.fi wrote: 01.05.2012 21:34, Ralf Gommers kirjoitti: [clip] At this point it's probably good to look again at the problems we want to solve: 1. responsive user interface (must absolutely have) Now that it comes too late: with some luck, I've possibly hit on what was ailing the Tracs (max_diff_bytes configured too large). Let's see if things work better from now on... That's amazing - not only does it not give errors anymore, it's also an order of magnitude faster. So maybe we could just stick with trac. Performance was really the sticking point. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion FWIW I'm pretty strongly in favor of GHI for NumPy/SciPy (I am going to get involved in NumPy dev eventually, promise). While warty in some of the places already mentioned, I have found it to be very low-friction and low-annoyance in my own dev process (nearing 1000 issues closed in the last year in pandas). But there are fewer cooks in the kitchen with pandas so perhaps this experience wouldn't be identical with NumPy. The biggest benefit I've seen is community involvement that you really wouldn't see if I were using a Trac or something else hosted elsewhere. Users are on GitHub and it for some reason gives people a feeling of engagement in the open source process that I don't see anywhere else. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] datetime dtype possible regression
On Sat, Apr 28, 2012 at 11:18 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Sat, Apr 28, 2012 at 9:13 AM, Wes McKinney wesmck...@gmail.com wrote: On Fri, Apr 27, 2012 at 4:57 PM, Robert Kern robert.k...@gmail.com wrote: On Fri, Apr 27, 2012 at 21:52, Travis Vaught tra...@vaught.net wrote: With NumPy 1.6.1 (from EPD 7.2-2) I get this behavior: ~ In [1]: import numpy as np In [2]: schema = np.dtype({'names':['symbol', 'date', 'open', 'high', 'low', ...: 'close', 'volume', 'adjclose'], ...: 'formats':['S8', 'M8', float, float, float, float, ...: float, float]}) In [3]: data = [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0, 5000, 602.0),] In [4]: recdata = np.array(data, dtype=schema) In [5]: recdata Out[5]: array([ ('AAPL', datetime.datetime(2012, 4, 12, 0, 0), 600.0, 605.0, 598.0, 602.0, 5000.0, 602.0)], dtype=[('symbol', '|S8'), ('date', ('M8[us]', {})), ('open', 'f8'), ('high', 'f8'), ('low', 'f8'), ('close', 'f8'), ('volume', 'f8'), ('adjclose', 'f8')]) With numpy-1.7.0.dev_3cb783e I get this: import numpy as np schema = np.dtype({'names':['symbol','data','open','high','low','close','volume','adjclose'], 'formats':['S8','M8',float,float,float,float,float,float]}) data = [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0, 5000, 602.0),] recdata = np.array(data, dtype=schema) Traceback (most recent call last): File stdin, line 1, in module ValueError: Cannot create a NumPy datetime other than NaT with generic units np.version.version '1.7.0.dev-3cb783e' Any hints about a regression I can check for? Or perhaps I missed an api change for specifying datetime dtypes? Judging from the error message, it looks like an intentional API change. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Maybe this should be raised as a bug (where do we report NumPy bugs these days, still Trac?). As I'm moving to datetime64 in pandas if NumPy 1.6.1 data has unpickling issues on NumPy 1.7+ it's going to be very problematic. I was wondering what datetime you were using since the version in 1.6 had issues. Have you tested with both? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Could you define issues? I haven't had a chance to make the library compatible with both 1.6.1 and 1.7.0 yet (like Travis I'm using NumPy 1.6.1 from EPD); it's important though as pandas will be the first widely used library I know of that will make heavy use of datetime64. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] datetime dtype possible regression
On Fri, Apr 27, 2012 at 4:57 PM, Robert Kern robert.k...@gmail.com wrote: On Fri, Apr 27, 2012 at 21:52, Travis Vaught tra...@vaught.net wrote: With NumPy 1.6.1 (from EPD 7.2-2) I get this behavior: ~ In [1]: import numpy as np In [2]: schema = np.dtype({'names':['symbol', 'date', 'open', 'high', 'low', ...: 'close', 'volume', 'adjclose'], ...: 'formats':['S8', 'M8', float, float, float, float, ...: float, float]}) In [3]: data = [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0, 5000, 602.0),] In [4]: recdata = np.array(data, dtype=schema) In [5]: recdata Out[5]: array([ ('AAPL', datetime.datetime(2012, 4, 12, 0, 0), 600.0, 605.0, 598.0, 602.0, 5000.0, 602.0)], dtype=[('symbol', '|S8'), ('date', ('M8[us]', {})), ('open', 'f8'), ('high', 'f8'), ('low', 'f8'), ('close', 'f8'), ('volume', 'f8'), ('adjclose', 'f8')]) With numpy-1.7.0.dev_3cb783e I get this: import numpy as np schema = np.dtype({'names':['symbol','data','open','high','low','close','volume','adjclose'], 'formats':['S8','M8',float,float,float,float,float,float]}) data = [(AAPL, 2012-04-12, 600.0, 605.0, 598.0, 602.0, 5000, 602.0),] recdata = np.array(data, dtype=schema) Traceback (most recent call last): File stdin, line 1, in module ValueError: Cannot create a NumPy datetime other than NaT with generic units np.version.version '1.7.0.dev-3cb783e' Any hints about a regression I can check for? Or perhaps I missed an api change for specifying datetime dtypes? Judging from the error message, it looks like an intentional API change. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Maybe this should be raised as a bug (where do we report NumPy bugs these days, still Trac?). As I'm moving to datetime64 in pandas if NumPy 1.6.1 data has unpickling issues on NumPy 1.7+ it's going to be very problematic. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Improving NumPy's indexing / subsetting / fancy indexing implementation
dear all, I've routinely found that: 1) ndarray.take is up to 1 order of magnitude faster than fancy indexing 2) Hand-coded Cython boolean indexing is many times faster than ndarray indexing 3) putmask is significantly faster than ndarray indexing For example, I stumbled on this tonight: straightforward cython code: def row_bool_subset(ndarray[float64_t, ndim=2] values, ndarray[uint8_t, cast=True] mask): cdef: Py_ssize_t i, j, n, k, pos = 0 ndarray[float64_t, ndim=2] out n, k = (object values).shape assert(n == len(mask)) out = np.empty((mask.sum(), k), dtype=np.float64) for i in range(n): if mask[i]: for j in range(k): out[pos, j] = values[i, j] pos += 1 return out In [1]: values = randn(100, 4) In [2]: mask = np.ones(100, dtype=bool) In [3]: import pandas._sandbox as sbx In [4]: result = sbx.row_bool_subset(values, mask) In [5]: timeit result = sbx.row_bool_subset(values, mask) 100 loops, best of 3: 9.58 ms per loop pure NumPy: In [6]: timeit values[mask] 10 loops, best of 3: 81.6 ms per loop Here's the kind of take performance problems that I routinely experience: In [12]: values = randn(100, 4) v In [13]: values.shape Out[13]: (100, 4) In [14]: indexer = np.random.permutation(100)[:50] In [15]: timeit values[indexer] 10 loops, best of 3: 70.7 ms per loop In [16]: timeit values.take(indexer, axis=0) 100 loops, best of 3: 13.3 ms per loop When I can spare the time in the future I will personally work on these indexing routines in the C codebase, but I suspect that I'm not the only one adversely affected by these kinds of performance problems, and it might be worth thinking about a community effort to split up the work of retooling these methods to be more performant. We could use a tool like my vbench project (github.com/wesm/vbench) to make a list of the performance benchmarks of interest (like the ones above). Unfortunately I am too time constrained at least for the next 6 months to devote to a complete rewrite of the code in question. Possibly sometime in 2013 if no one has gotten to it yet, but this seems like someplace that we should be concerned about as the language performance wars continue to rage. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] draft enum NEP
On Fri, Mar 9, 2012 at 5:48 PM, David Gowers (kampu) 00a...@gmail.com wrote: Hi, On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven bry...@continuum.io wrote: Hi all, I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub: https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome. t = np.dtype('enum', map=(n,v)) ^ Is this supposed to be indicating 'this is an enum with values ranging between n and v'? It could be a bit more clear. Is it possible to partially define an enum? That is, give the maximum and minimum values, and only some of the enumeration value:name mappings? For example, an enum where 0 means 'n/a', +n means 'Type A Object #(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map the non-scalar values, while having a way to avoid treating valid scalar values (eg +64) as out-of-range. Example of what I mean: t = np.dtype('enum[N_A:0]', range = (-127, 127)) (defined values being printed as a string, undefined being printed as a number.) David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'll have to think about this (a little brain dump here). I have many use cases in pandas where this would be useful which are basically direct translations of R's factor data type. Note that R always coerces the levels (the unique values) AFAICT to string type. However, mapping back to a well-dtyped array is important, too. So the temptation might be to do something like this: ndarray: dtype storage type (uint32 or something) mapping : khash with type PyObject* - uint32 Now, one problem with this is that you want the mapping + dtype to be invertible (otherwise you're left doing some type inference). The way that I implement the mapping is to restrict the labeling to be from 0 to N - 1 which makes things easier. If we decide that having an explicit value mapping The nice thing about this is that the same set of core algorithms can be used to fix numpy.unique. For example you would like to be able to do: enum_arr = np.enum(arr) (this seems like a reasonable API to me) and that is a direct equivalent of R's factor function. You need to be able to pass an explicit ordering when calling the enum/factor function. If not specified, you should have an option to either sort or not-- for example suppose you convert an array of 1 million integers to enum but you don't particularly care about the uniques (which could be very large, up to the size of the array) being ordered (no need to pay N log N for large N). One nice thing about khash is that it can be serialized fairly easily. Have you looked much at how I use enum-like ideas in pandas? It would be great if I could offload some of this data algorithmic work to NumPy. We will want the enum data type to integrate with text file readers-- if you factorize as you go you can drastically reduce the memory usage of a structured array (or pandas DataFrame) columns with long-ish strings and relatively few unique values. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Migrating issues to GitHub
On Thu, Feb 16, 2012 at 4:32 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Thu, Feb 16, 2012 at 10:20 PM, Thouis (Ray) Jones tho...@gmail.com wrote: On Thu, Feb 16, 2012 at 19:25, Ralf Gommers ralf.gomm...@googlemail.com wrote: In another thread Jira was proposed as an alternative to Trac. Can you point out some of its strengths and weaknesses, and tell us why you decided to move away from it? Jira reminded me of Java. OK, you convinced me:) Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Having just created a NumPy ticket (http://projects.scipy.org/numpy/ticket/2065) I feel pretty strongly about moving the issue tracker to GitHub--the lack of attachments is easy to work around. I think it will help a lot for building more community engagement with the development process. My experience using it with pandas has been very positive-- I have churned through around 700 issues over the last 12 months and I've never felt like it's gotten in my way (except for the occasional CSS/JS bugs in the web UI). - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Fri, Feb 24, 2012 at 9:07 AM, Erin Sheldon erin.shel...@gmail.com wrote: Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012: This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. I'm willing to adapt my code if it is wanted, but at the same time I don't want to step on this person's toes. Should I proceed? -e -- Erin Scott Sheldon Brookhaven National Laboratory ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion That may work-- I haven't taken a look at the code but it is probably a good starting point. We could create a new repo on the pydata GitHub org (http://github.com/pydata) and use that as our point of collaboration. I will hopefully be able to put some serious energy into this this spring. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Possible roadmap addendum: building better text file readers
dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list. All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to factorize: as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to factorize / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:08 PM, Travis Oliphant tra...@continuum.io wrote: This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis Yeah, what I envision is just an infrastructural parsing engine to replace the pure Python guts of np.loadtxt, np.genfromtxt, and the csv module + Cython guts of pandas.read_{csv, table, excel}. It needs to be somewhat adaptable to some of the domain specific decisions of structured arrays vs. DataFrames-- like I use Python objects for strings, but one consequence of this is that I can intern strings (only one PyObject per unique string value occurring) where structured arrays cannot, so you get much better performance and memory usage that way. That's soon to change, though, I gather, at which point I'll almost definitely (!) move to pointer arrays instead of dtype=object arrays. - Wes On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote: Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:19 PM, Warren Weckesser warren.weckes...@enthought.com wrote: On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant tra...@continuum.io wrote: This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis I have a proof of concept CSV reader written in C (with a Cython wrapper). I'll put it on github this weekend. Warren Sweet, between this, Continuum folks, and me and my guys I think we can come up with something good and suits all our needs. We should set up some realistic performance test cases that we can monitor via vbench (wesm/vbench) while we're work on the project. - W On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote: Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon erin.shel...@gmail.com wrote: Wes - I designed the recfile package to fill this need. It might be a start. Some features: - the ability to efficiently read any subset of the data without loading the whole file. - reads directly into a recarray, so no overheads. - object oriented interface, mimicking recarray slicing. - also supports writing Currently it is fixed-width fields only. It is C++, but wouldn't be hard to convert it C if that is a requirement. Also, it works for binary or ascii. http://code.google.com/p/recfile/ the trunk is pretty far past the most recent release. Erin Scott Sheldon Can you relicense as BSD-compatible? Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012: dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list. All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to factorize: as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to factorize / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes -- Erin Scott Sheldon Brookhaven National Laboratory ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:31 PM, Éric Depagne e...@depagne.org wrote: Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit : That would indeed be great. Reading large files is a real pain whatever the python method used. BTW, could you tell us what you mean by large files? cheers, Éric. Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:55 PM, Erin Sheldon erin.shel...@gmail.com wrote: Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012: Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures. As a benchmark, recfile can read an uncached file with 350,000 lines and 32 columns in about 5 seconds. File size ~220M -e -- Erin Scott Sheldon Brookhaven National Laboratory That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all: In [1]: df = DataFrame(np.random.randn(35, 32)) In [2]: df.to_csv('/home/wesm/tmp/foo.csv') In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float. Note for reference (i'm skipping the first row which has the column labels from above): In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. -W ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 4:20 PM, Erin Sheldon erin.shel...@gmail.com wrote: Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012: That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all: In [1]: df = DataFrame(np.random.randn(35, 32)) In [2]: df.to_csv('/home/wesm/tmp/foo.csv') In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float. Note for reference (i'm skipping the first row which has the column labels from above): In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. It might be good to compare on recarrays, which are a bit more complex. Can you try one of these .dat files? http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/ The dtype is [('ra', 'f8'), ('dec', 'f8'), ('g1', 'f8'), ('g2', 'f8'), ('err', 'f8'), ('scinv', 'f8', 27)] -- Erin Scott Sheldon Brookhaven National Laboratory Forgot this one that is also widely used: In [28]: %time recs = matplotlib.mlab.csv2rec('/home/wesm/tmp/foo.csv', skiprows=1) CPU times: user 65.16 s, sys: 0.30 s, total: 65.46 s Wall time: 65.55 s ok with one of those dat files and the dtype I get: In [18]: %time arr = np.genfromtxt('/home/wesm/Downloads/scat-05-000.dat', dtype=dtype, skip_header=0, delimiter=' ') CPU times: user 17.52 s, sys: 0.14 s, total: 17.66 s Wall time: 17.67 s difference not so stark in this case. I don't produce structured arrays, though In [26]: %time arr = read_table('/home/wesm/Downloads/scat-05-000.dat', header=None, sep=' ') CPU times: user 10.15 s, sys: 0.10 s, total: 10.25 s Wall time: 10.26 s - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Index Array Performance
On Tue, Feb 14, 2012 at 4:03 AM, Francesc Alted franc...@continuum.io wrote: On Feb 14, 2012, at 1:50 AM, Wes McKinney wrote: [clip] But: In [40]: timeit hist[i, j] 1 loops, best of 3: 32 us per loop So that's roughly 7-8x slower than a simple Cython method, so I sincerely hope it could be brought down to the sub 10 microsecond level with a little bit of work. I vaguely remember this has shown up before. My hunch is that indexing in NumPy is so powerful, that it has to check for a lot of different values for indices (integers, tuples, lists, arrays…), and that it is all these checks what is taking time. Your Cython wrapper just assumed that the indices where integers, so this is probably the reason why it is that much faster. This is not to say that indexing in NumPy could not be accelerated, but it won't be trivial, IMO. Given that __getitem__ and __setitem__ receive a 2-tuple of 1-dimensional integer arrays, should be pretty simple (dare I say trivial? :) ) to optimize for this use case? The abysmal performance of of __getitem__ and __setitem__ with 1d integer arrays is pretty high on my list of annoyances with NumPy (especially when take and put are so much faster), so you guys may see a pull request from me whenever I can spare the time to hack on it (assuming you don't beat me to it)! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Index Array Performance
On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver m.oli...@jacobs-university.de wrote: Hi, I have a short piece of code where the use of an index array feels right, but incurs a severe performance penalty: It's about an order of magnitude slower than all other operations with arrays of that size. It comes up in a piece of code which is doing a large number of on the fly histograms via hist[i,j] += 1 where i is an array with the bin index to be incremented and j is simply enumerating the histograms. I attach a full short sample code below which shows how it's being used in context, and corresponding timeit output from the critical code section. Questions: - Is this a matter of principle, or due to an inefficient implementation? - Is there an equivalent way of doing it which is fast? Regards, Marcel = #! /usr/bin/env python # Plot the bifurcation diagram of the logistic map from pylab import * Nx = 800 Ny = 600 I = 5 rmin = 2.5 rmax = 4.0 ymin = 0.0 ymax = 1.0 rr = linspace (rmin, rmax, Nx) x = 0.5*ones(rr.shape) hist = zeros((Ny+1,Nx), dtype=int) j = arange(Nx) dy = ymax/Ny def f(x): return rr*x*(1.0-x) for n in xrange(1000): x = f(x) for n in xrange(I): x = f(x) i = array(x/dy, dtype=int) hist[i,j] += 1 figure() imshow(hist, cmap='binary', origin='lower', interpolation='nearest', extent=(rmin,rmax,ymin,ymax), norm=matplotlib.colors.LogNorm()) xlabel ('$r$') ylabel ('$x$') title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$') show() In [4]: timeit y=f(x) 1 loops, best of 3: 19.4 us per loop In [5]: timeit i = array(x/dy, dtype=int) 1 loops, best of 3: 22 us per loop In [6]: timeit img[i,j] += 1 1 loops, best of 3: 119 us per loop ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This suggests to me that fancy indexing could be quite a bit faster in this case: In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1) 1 loops, best of 3: 20.6 us per loop I wrote a simple Cython method def fancy_inc(ndarray[int64_t, ndim=2] values, ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc): cdef: Py_ssize_t i, n = len(iarr) for i in range(n): values[iarr[i], jarr[i]] += inc that does even faster In [8]: timeit sbx.fancy_inc(hist, i, j, 1) 10 loops, best of 3: 4.85 us per loop About 10% faster if bounds checking and wraparound are disabled. Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list? - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Index Array Performance
On Mon, Feb 13, 2012 at 7:30 PM, Nathaniel Smith n...@pobox.com wrote: How would you fix it? I shouldn't speculate without profiling, but I'll be naughty. Presumably the problem is that python turns that into something like hist[i,j] = hist[i,j] + 1 which means there's no way for numpy to avoid creating a temporary array. So maybe this could be fixed by adding a fused __inplace_add__ protocol to the language (and similarly for all the other inplace operators), but that seems really unlikely. Fundamentally this is just the sort of optimization opportunity you miss when you don't have a compiler with a global view; Fortran or c++ expression templates will win every time. Maybe pypy will fix it someday. Perhaps it would help to make np.add(hist, 1, out=hist, where=(i,j)) work? - N Nope, don't buy it: In [33]: timeit arr.__iadd__(1) 1000 loops, best of 3: 1.13 ms per loop In [37]: timeit arr[:] += 1 1000 loops, best of 3: 1.13 ms per loop - Wes On Feb 14, 2012 12:18 AM, Wes McKinney wesmck...@gmail.com wrote: On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver m.oli...@jacobs-university.de wrote: Hi, I have a short piece of code where the use of an index array feels right, but incurs a severe performance penalty: It's about an order of magnitude slower than all other operations with arrays of that size. It comes up in a piece of code which is doing a large number of on the fly histograms via hist[i,j] += 1 where i is an array with the bin index to be incremented and j is simply enumerating the histograms. I attach a full short sample code below which shows how it's being used in context, and corresponding timeit output from the critical code section. Questions: - Is this a matter of principle, or due to an inefficient implementation? - Is there an equivalent way of doing it which is fast? Regards, Marcel = #! /usr/bin/env python # Plot the bifurcation diagram of the logistic map from pylab import * Nx = 800 Ny = 600 I = 5 rmin = 2.5 rmax = 4.0 ymin = 0.0 ymax = 1.0 rr = linspace (rmin, rmax, Nx) x = 0.5*ones(rr.shape) hist = zeros((Ny+1,Nx), dtype=int) j = arange(Nx) dy = ymax/Ny def f(x): return rr*x*(1.0-x) for n in xrange(1000): x = f(x) for n in xrange(I): x = f(x) i = array(x/dy, dtype=int) hist[i,j] += 1 figure() imshow(hist, cmap='binary', origin='lower', interpolation='nearest', extent=(rmin,rmax,ymin,ymax), norm=matplotlib.colors.LogNorm()) xlabel ('$r$') ylabel ('$x$') title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$') show() In [4]: timeit y=f(x) 1 loops, best of 3: 19.4 us per loop In [5]: timeit i = array(x/dy, dtype=int) 1 loops, best of 3: 22 us per loop In [6]: timeit img[i,j] += 1 1 loops, best of 3: 119 us per loop ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This suggests to me that fancy indexing could be quite a bit faster in this case: In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1) 1 loops, best of 3: 20.6 us per loop I wrote a simple Cython method def fancy_inc(ndarray[int64_t, ndim=2] values, ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc): cdef: Py_ssize_t i, n = len(iarr) for i in range(n): values[iarr[i], jarr[i]] += inc that does even faster In [8]: timeit sbx.fancy_inc(hist, i, j, 1) 10 loops, best of 3: 4.85 us per loop About 10% faster if bounds checking and wraparound are disabled. Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list? - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Index Array Performance
On Mon, Feb 13, 2012 at 7:46 PM, Wes McKinney wesmck...@gmail.com wrote: On Mon, Feb 13, 2012 at 7:30 PM, Nathaniel Smith n...@pobox.com wrote: How would you fix it? I shouldn't speculate without profiling, but I'll be naughty. Presumably the problem is that python turns that into something like hist[i,j] = hist[i,j] + 1 which means there's no way for numpy to avoid creating a temporary array. So maybe this could be fixed by adding a fused __inplace_add__ protocol to the language (and similarly for all the other inplace operators), but that seems really unlikely. Fundamentally this is just the sort of optimization opportunity you miss when you don't have a compiler with a global view; Fortran or c++ expression templates will win every time. Maybe pypy will fix it someday. Perhaps it would help to make np.add(hist, 1, out=hist, where=(i,j)) work? - N Nope, don't buy it: In [33]: timeit arr.__iadd__(1) 1000 loops, best of 3: 1.13 ms per loop In [37]: timeit arr[:] += 1 1000 loops, best of 3: 1.13 ms per loop - Wes Actually, apologies, I'm being silly (had too much coffee or something). Python may be doing something nefarious with the hist[i,j] += 1. So both a get, add, then set, which is probably the problem. On Feb 14, 2012 12:18 AM, Wes McKinney wesmck...@gmail.com wrote: On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver m.oli...@jacobs-university.de wrote: Hi, I have a short piece of code where the use of an index array feels right, but incurs a severe performance penalty: It's about an order of magnitude slower than all other operations with arrays of that size. It comes up in a piece of code which is doing a large number of on the fly histograms via hist[i,j] += 1 where i is an array with the bin index to be incremented and j is simply enumerating the histograms. I attach a full short sample code below which shows how it's being used in context, and corresponding timeit output from the critical code section. Questions: - Is this a matter of principle, or due to an inefficient implementation? - Is there an equivalent way of doing it which is fast? Regards, Marcel = #! /usr/bin/env python # Plot the bifurcation diagram of the logistic map from pylab import * Nx = 800 Ny = 600 I = 5 rmin = 2.5 rmax = 4.0 ymin = 0.0 ymax = 1.0 rr = linspace (rmin, rmax, Nx) x = 0.5*ones(rr.shape) hist = zeros((Ny+1,Nx), dtype=int) j = arange(Nx) dy = ymax/Ny def f(x): return rr*x*(1.0-x) for n in xrange(1000): x = f(x) for n in xrange(I): x = f(x) i = array(x/dy, dtype=int) hist[i,j] += 1 figure() imshow(hist, cmap='binary', origin='lower', interpolation='nearest', extent=(rmin,rmax,ymin,ymax), norm=matplotlib.colors.LogNorm()) xlabel ('$r$') ylabel ('$x$') title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$') show() In [4]: timeit y=f(x) 1 loops, best of 3: 19.4 us per loop In [5]: timeit i = array(x/dy, dtype=int) 1 loops, best of 3: 22 us per loop In [6]: timeit img[i,j] += 1 1 loops, best of 3: 119 us per loop ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This suggests to me that fancy indexing could be quite a bit faster in this case: In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1) 1 loops, best of 3: 20.6 us per loop I wrote a simple Cython method def fancy_inc(ndarray[int64_t, ndim=2] values, ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc): cdef: Py_ssize_t i, n = len(iarr) for i in range(n): values[iarr[i], jarr[i]] += inc that does even faster In [8]: timeit sbx.fancy_inc(hist, i, j, 1) 10 loops, best of 3: 4.85 us per loop About 10% faster if bounds checking and wraparound are disabled. Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list? - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Index Array Performance
On Mon, Feb 13, 2012 at 7:48 PM, Wes McKinney wesmck...@gmail.com wrote: On Mon, Feb 13, 2012 at 7:46 PM, Wes McKinney wesmck...@gmail.com wrote: On Mon, Feb 13, 2012 at 7:30 PM, Nathaniel Smith n...@pobox.com wrote: How would you fix it? I shouldn't speculate without profiling, but I'll be naughty. Presumably the problem is that python turns that into something like hist[i,j] = hist[i,j] + 1 which means there's no way for numpy to avoid creating a temporary array. So maybe this could be fixed by adding a fused __inplace_add__ protocol to the language (and similarly for all the other inplace operators), but that seems really unlikely. Fundamentally this is just the sort of optimization opportunity you miss when you don't have a compiler with a global view; Fortran or c++ expression templates will win every time. Maybe pypy will fix it someday. Perhaps it would help to make np.add(hist, 1, out=hist, where=(i,j)) work? - N Nope, don't buy it: In [33]: timeit arr.__iadd__(1) 1000 loops, best of 3: 1.13 ms per loop In [37]: timeit arr[:] += 1 1000 loops, best of 3: 1.13 ms per loop - Wes Actually, apologies, I'm being silly (had too much coffee or something). Python may be doing something nefarious with the hist[i,j] += 1. So both a get, add, then set, which is probably the problem. On Feb 14, 2012 12:18 AM, Wes McKinney wesmck...@gmail.com wrote: On Mon, Feb 13, 2012 at 6:23 PM, Marcel Oliver m.oli...@jacobs-university.de wrote: Hi, I have a short piece of code where the use of an index array feels right, but incurs a severe performance penalty: It's about an order of magnitude slower than all other operations with arrays of that size. It comes up in a piece of code which is doing a large number of on the fly histograms via hist[i,j] += 1 where i is an array with the bin index to be incremented and j is simply enumerating the histograms. I attach a full short sample code below which shows how it's being used in context, and corresponding timeit output from the critical code section. Questions: - Is this a matter of principle, or due to an inefficient implementation? - Is there an equivalent way of doing it which is fast? Regards, Marcel = #! /usr/bin/env python # Plot the bifurcation diagram of the logistic map from pylab import * Nx = 800 Ny = 600 I = 5 rmin = 2.5 rmax = 4.0 ymin = 0.0 ymax = 1.0 rr = linspace (rmin, rmax, Nx) x = 0.5*ones(rr.shape) hist = zeros((Ny+1,Nx), dtype=int) j = arange(Nx) dy = ymax/Ny def f(x): return rr*x*(1.0-x) for n in xrange(1000): x = f(x) for n in xrange(I): x = f(x) i = array(x/dy, dtype=int) hist[i,j] += 1 figure() imshow(hist, cmap='binary', origin='lower', interpolation='nearest', extent=(rmin,rmax,ymin,ymax), norm=matplotlib.colors.LogNorm()) xlabel ('$r$') ylabel ('$x$') title('Attractor of the logistic map $x_{n+1} = r \, x_n (1-x_n)$') show() In [4]: timeit y=f(x) 1 loops, best of 3: 19.4 us per loop In [5]: timeit i = array(x/dy, dtype=int) 1 loops, best of 3: 22 us per loop In [6]: timeit img[i,j] += 1 1 loops, best of 3: 119 us per loop ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This suggests to me that fancy indexing could be quite a bit faster in this case: In [40]: timeit hist[i,j] += 11 loops, best of 3: 58.2 us per loop In [39]: timeit hist.put(np.ravel_multi_index((i, j), hist.shape), 1) 1 loops, best of 3: 20.6 us per loop I wrote a simple Cython method def fancy_inc(ndarray[int64_t, ndim=2] values, ndarray[int64_t] iarr, ndarray[int64_t] jarr, int64_t inc): cdef: Py_ssize_t i, n = len(iarr) for i in range(n): values[iarr[i], jarr[i]] += inc that does even faster In [8]: timeit sbx.fancy_inc(hist, i, j, 1) 10 loops, best of 3: 4.85 us per loop About 10% faster if bounds checking and wraparound are disabled. Kind of a bummer-- perhaps this should go high on the NumPy 2.0 TODO list? - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion But: In [40]: timeit hist[i, j] 1 loops, best of 3: 32 us per loop So that's roughly 7-8x slower than a simple Cython method, so I sincerely hope it could be brought down to the sub 10 microsecond level with a little bit of work
Re: [Numpy-discussion] fast method to to count a particular value in a large matrix
On Sun, Feb 5, 2012 at 7:02 PM, David Verelst david.vere...@gmail.com wrote: Just out of curiosity, what speed-up factor did you achieve? Regards, David On 04/02/12 22:20, Naresh wrote: Warren Weckesserwarren.weckesserat enthought.com writes: On Sat, Feb 4, 2012 at 2:35 PM, Benjamin Rootben.rootat ou.edu wrote: On Saturday, February 4, 2012, Naresh Painpaiat uark.edu wrote: I am somewhat new to Python (been coding with Matlab mostly). I am trying to simplify (and expedite) a piece of code that is currently a bottleneck in a larger code. I have a large array (7000 rows x 4500 columns) titled say, abc, and I am trying to find a fast method to count the number of instances of each unique value within it. All unique values are stored in a variable, say, unique_elem. My current code is as follows: import numpy as np #allocate space for storing element count elem_count = zeros((len(unique_elem),1)) #loop through and count number of unique_elem for i in range(len(unique_elem)): elem_count[i]= np.sum(reduce(np.logical_or,(abc== x for x in [unique_elem[i]]))) This loop is bottleneck because I have about 850 unique elements and it takes about 9-10 minutes. Can you suggest a faster way to do this? Thank you, Naresh no.unique() can return indices and reverse indices. It would be trivial to histogram the reverse indices using np.histogram(). Instead of histogram(), you can use bincount() on the inverse indices:u, inv = np.unique(abc, return_inverse=True)n = np.bincount(inv)u will be an array of the unique elements, and n will be an array of the corresponding number of occurrences.Warren ___ NumPy-Discussion mailing list NumPy-Discussionat scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion The histogram() solution works perfect since unique_elem is ordered. I appreciate everyone's help. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion np.histogram works pretty well. I'm getting speeds something like 1300 ms on float64 data. A hash table-based solution is faster (no big surprise here), about 800ms so in the ballpark of 40% faster. Whenever I get motivated enough I'm going to make a pull request on NumPy with something like khash.h and start fixing all the O(N log N) algorithms floating around that ought to be O(N). NumPy should really have a match function similar to R's and a lot of other things. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy 'groupby'
On Wed, Jan 11, 2012 at 7:05 AM, Neal Becker ndbeck...@gmail.com wrote: Michael Hull wrote: Hi Everyone, First off, thanks for all your hard work on numpy, its a really great help! I was wondering if there was a standard 'groupby' in numpy, that similar to that in itertools. I know its not hard to write with np.diff, but I have found myself writing it on more than a couple of occasions, and wondered if there was a 'standarised' version I was missing out on?? Thanks, Mike I've played with groupby in pandas. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I agree (unsurprisingly) that pandas is your best bet: http://pandas.sourceforge.net/groupby.html I've had it on my TODO list to extend the pandas groupby engine (which has grown fairly sophisticated) to work with generic ndarrays and record arrays: https://github.com/wesm/pandas/issues/123 It shouldn't actually be that hard for most simple cases. I could imagine the results of a groupby being somewhat difficult to interpret without axis labeling/indexing, though. cheers, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PyInt and Numpy's int64 conversion
On Wed, Jan 4, 2012 at 5:22 AM, xantares 09 xantare...@hotmail.com wrote: From: wesmck...@gmail.com Date: Sat, 24 Dec 2011 19:51:06 -0500 To: numpy-discussion@scipy.org Subject: Re: [Numpy-discussion] PyInt and Numpy's int64 conversion On Sat, Dec 24, 2011 at 3:11 AM, xantares 09 xantare...@hotmail.com wrote: From: wesmck...@gmail.com Date: Fri, 23 Dec 2011 12:31:45 -0500 To: numpy-discussion@scipy.org Subject: Re: [Numpy-discussion] PyInt and Numpy's int64 conversion On Fri, Dec 23, 2011 at 4:37 AM, xantares 09 xantare...@hotmail.com wrote: Hi, I'm using Numpy from the C python api side while tweaking my SWIG interface to work with numpy array types. I want to convert a numpy array of integers (whose elements are numpy's 'int64') The problem is that it this int64 type is not compatible with the standard python integer type: I cannot use PyInt_Check, and PyInt_AsUnsignedLongMask to check and convert from int64: basically PyInt_Check returns false. I checked the numpy config header and npy_int64 does have a size of 8o, which should be the same as int on my x86_64. What is the correct way to do that ? I checked for a Int64_Check function and didn't find any in numpy headers. Regards, x. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion hello, I think you'll want to use the C macro PyArray_IsIntegerScalar, e.g. in pandas I have the following function exposed to my Cython code: PANDAS_INLINE int is_integer_object(PyObject* obj) { return PyArray_IsIntegerScalar(obj); } last time I checked that macro detects Python int, long, and all of the NumPy integer hierarchy (int8, 16, 32, 64). If you ONLY want to check for int64 I am not 100% sure the best way. - Wes Hi, Thank you for your reply ! That's the thing : I want to check/convert every type of integer, numpy's int64 and also python standard ints. Is there a way to avoid to use only the python api ? ( and avoid to depend on numpy's PyArray_* functions ) Regards. x. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion No. All of the PyTypeObject objects for the NumPy array scalars are explicitly part of the NumPy C API so you have no choice but to depend on that (to get the best performance). If you want to ONLY check for int64 at the C API level, I did a bit of digging and the relevant type definitions are in https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/npy_common.h so you'll want to do: int is_int64(PyObject* obj){ return PyObject_TypeCheck(obj, PyInt64ArrType_Type); } and that will *only* detect np.int64 - Wes Ok many thanks ! One last thing, do you happen to know how to actually convert an np int64 to a C int ? - x. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Not sure off-hand. You'll have to look at the NumPy scalar API in the C code ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Enum type
On Tue, Jan 3, 2012 at 12:46 PM, Ognen Duzlevski og...@enthought.com wrote: Hello, I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms. I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered pretty or maybe even pythonic. I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel heavy to me. What would be a proper numpy approach to this? I am looking mostly for direction and advice as I would like to do the work myself :-) Any input appreciated :-) Ognen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion You should use a hash table internally in my opinion. I've started using khash from klib (https://github.com/attractivechaos/klib) which has excellent memory usage (more than 50% less than Python dict with large hash tables) and good performance characteristics. With the enum dtype you can avoid reference counting with primitive types, not sure about object dtype. If enum arrays are mutable this will be very tricky. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Enum type
On Tue, Jan 3, 2012 at 1:06 PM, Jim Vickroy jim.vick...@noaa.gov wrote: On 1/3/2012 10:46 AM, Ognen Duzlevski wrote: Hello, I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms. I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered pretty or maybe even pythonic. I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel heavy to me. What would be a proper numpy approach to this? I am looking mostly for direction and advice as I would like to do the work myself :-) Any input appreciated :-) Ognen Does enumerate (http://docs.python.org/library/functions.html#enumerate) work for you? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion That's not exactly what he means. The R lingo for this concept is factor or a bit more common categorical variable: http://stat.ethz.ch/R-manual/R-patched/library/base/html/factor.html FWIW R's factor type is implemented using hash tables. I do the same in pandas. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] GSOC
On Thu, Dec 29, 2011 at 4:36 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Thu, Dec 29, 2011 at 9:50 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, I thought I'd raise this topic just to get some ideas out there. At the moment I see two areas that I'd like to see addressed. Documentation editor. This would involve looking at the generated documentation and it's organization/coverage as well such things as style and maybe reviewing stuff on the documentation site. This would be more technical writing than coding. Test coverage. There are a lot of areas of numpy that are not well tested as well as some tests that are still doc tests and should probably be updated. This is a substantial amount of work and would require some familiarity with numpy as well as a willingness to ping developers for clarification of some topics. Thoughts? First thought: very useful, but probably not GSOC topics by themselves. For a very good student, I'd think topics like implementing NA bit masks or improved user-defined dtypes would be interesting. In SciPy there's also a lot to do, and that's probably a better project for students who prefer to work in Python. Thanks for bringing this up. Last year we missed the boat, it would be good to get one or more slots this year. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Along with test coverage, have any of you considered any systematic monitoring of NumPy performance? With all of the extensive refactoring / work on the internals, it would be useful to keep an eye on things in case of any performance regressions. I mention this because I started a little prototype project (http://github.com/wesm/vbench) for doing exactly that for my own development purposes-- it's already proved extremely useful. Anyway, just a thought. I'm sure a motivated student could spend a whole summer writing unit tests for NumPy and nothing else. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PyInt and Numpy's int64 conversion
On Sat, Dec 24, 2011 at 3:11 AM, xantares 09 xantare...@hotmail.com wrote: From: wesmck...@gmail.com Date: Fri, 23 Dec 2011 12:31:45 -0500 To: numpy-discussion@scipy.org Subject: Re: [Numpy-discussion] PyInt and Numpy's int64 conversion On Fri, Dec 23, 2011 at 4:37 AM, xantares 09 xantare...@hotmail.com wrote: Hi, I'm using Numpy from the C python api side while tweaking my SWIG interface to work with numpy array types. I want to convert a numpy array of integers (whose elements are numpy's 'int64') The problem is that it this int64 type is not compatible with the standard python integer type: I cannot use PyInt_Check, and PyInt_AsUnsignedLongMask to check and convert from int64: basically PyInt_Check returns false. I checked the numpy config header and npy_int64 does have a size of 8o, which should be the same as int on my x86_64. What is the correct way to do that ? I checked for a Int64_Check function and didn't find any in numpy headers. Regards, x. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion hello, I think you'll want to use the C macro PyArray_IsIntegerScalar, e.g. in pandas I have the following function exposed to my Cython code: PANDAS_INLINE int is_integer_object(PyObject* obj) { return PyArray_IsIntegerScalar(obj); } last time I checked that macro detects Python int, long, and all of the NumPy integer hierarchy (int8, 16, 32, 64). If you ONLY want to check for int64 I am not 100% sure the best way. - Wes Hi, Thank you for your reply ! That's the thing : I want to check/convert every type of integer, numpy's int64 and also python standard ints. Is there a way to avoid to use only the python api ? ( and avoid to depend on numpy's PyArray_* functions ) Regards. x. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion No. All of the PyTypeObject objects for the NumPy array scalars are explicitly part of the NumPy C API so you have no choice but to depend on that (to get the best performance). If you want to ONLY check for int64 at the C API level, I did a bit of digging and the relevant type definitions are in https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/npy_common.h so you'll want to do: int is_int64(PyObject* obj){ return PyObject_TypeCheck(obj, PyInt64ArrType_Type); } and that will *only* detect np.int64 - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PyInt and Numpy's int64 conversion
On Fri, Dec 23, 2011 at 4:37 AM, xantares 09 xantare...@hotmail.com wrote: Hi, I'm using Numpy from the C python api side while tweaking my SWIG interface to work with numpy array types. I want to convert a numpy array of integers (whose elements are numpy's 'int64') The problem is that it this int64 type is not compatible with the standard python integer type: I cannot use PyInt_Check, and PyInt_AsUnsignedLongMask to check and convert from int64: basically PyInt_Check returns false. I checked the numpy config header and npy_int64 does have a size of 8o, which should be the same as int on my x86_64. What is the correct way to do that ? I checked for a Int64_Check function and didn't find any in numpy headers. Regards, x. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion hello, I think you'll want to use the C macro PyArray_IsIntegerScalar, e.g. in pandas I have the following function exposed to my Cython code: PANDAS_INLINE int is_integer_object(PyObject* obj) { return PyArray_IsIntegerScalar(obj); } last time I checked that macro detects Python int, long, and all of the NumPy integer hierarchy (int8, 16, 32, 64). If you ONLY want to check for int64 I am not 100% sure the best way. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Mon, Dec 12, 2011 at 12:34 PM, Warren Weckesser warren.weckes...@enthought.com wrote: On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker chris.bar...@noaa.gov wrote: On 12/11/11 8:40 AM, Ralf Gommers wrote: On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader. You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. I don't think the version in my mind is contradictory (Not quite). What I'm imagining is that a good, fast ascii to numpy array reader could read a whole table in at once (the common, easy, fast, case), but it could also be used to read snippets of a file in at a time, which could be leveraged to handle many of the more complex cases. I suppose there will always be cases where the user needs to write their own converter from string to dtype, and there is simply no way to leverage what I'm imagining to supported that. Hmm, maybe there is -- for instance, if a record consisted off mostly standard, easy-to-parse, numbers, but one field was some weird text that needed custom parsing, we could read it as a dtype, with a string for that one weird field, and that could be converted in a post-processing step. Maybe that wouldn't be any faster or easier, but it could be done... Anyway, whether you can leverage it for the full-featured version or not, I do think there is call for a good, fast, 90% case text file parser. Would anyone like to join/form a small working group to work on this? Wes, I'd like to see your Cython version -- maybe a starting point? -Chris I'm also working on a faster text file reader, so count me in. I've been experimenting in both C and Cython. I'll put it on github as soon as I can. Warren -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Cool, Warren, I look forward to seeing it. I'm hopeful we can craft a performant tool that will meet the needs of of many projects (NumPy, pandas, etc.)... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy 1.7.0 release?
On Tue, Dec 6, 2011 at 4:11 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Mon, Dec 5, 2011 at 8:43 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: Hi all, It's been a little over 6 months since the release of 1.6.0 and the NA debate has quieted down, so I'd like to ask your opinion on the timing of 1.7.0. It looks to me like we have a healthy amount of bug fixes and small improvements, plus three larger chucks of work: - datetime - NA - Bento support My impression is that both datetime and NA are releasable, but should be labeled tech preview or something similar, because they may still see significant changes. Please correct me if I'm wrong. There's still some maintenance work to do and pull requests to merge, but a beta release by Christmas should be feasible. To be a bit more detailed here, these are the most significant pull requests / patches that I think can be merged with a limited amount of work: meshgrid enhancements: http://projects.scipy.org/numpy/ticket/966 sample_from function: https://github.com/numpy/numpy/pull/151 loadtable function: https://github.com/numpy/numpy/pull/143 Other maintenance things: - un-deprecate putmask - clean up causes of DType strings 'O4' and 'O8' are deprecated... - fix failing einsum and polyfit tests - update release notes Cheers, Ralf What do you all think? Cheers, Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This isn't the place for this discussion but we should start talking about building a *high performance* flat file loading solution with good column type inference and sensible defaults, etc. It's clear that loadtable is aiming for highest compatibility-- for example I can read a 2800x30 file in 50 ms with the read_table / read_csv functions I wrote myself recent in Cython (compared with loadtable taking 1s as quoted in the pull request), but I don't handle European decimal formats and lots of other sources of unruliness. I personally don't believe in sacrificing an order of magnitude of performance in the 90% case for the 10% case-- so maybe it makes sense to have two functions around: a superfast custom CSV reader for well-behaved data, and a slower, but highly flexible, function like loadtable to fall back on. I think R has two functions read.csv and read.csv2, where read.csv2 is capable of dealing with things like European decimal format. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] consensus (was: NA masks in the next numpy release?)
On Fri, Oct 28, 2011 at 9:32 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Oct 28, 2011 at 6:45 PM, Wes McKinney wesmck...@gmail.com wrote: On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root ben.r...@ou.edu wrote: On Friday, October 28, 2011, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant oliph...@enthought.com wrote: I think Nathaniel and Matthew provided very specific feedback that was helpful in understanding other perspectives of a difficult problem. In particular, I really wanted bit-patterns implemented. However, I also understand that Mark did quite a bit of work and altered his original designs quite a bit in response to community feedback. I wasn't a major part of the pull request discussion, nor did I merge the changes, but I support Charles if he reviewed the code and felt like it was the right thing to do. I likely would have done the same thing rather than let Mark Wiebe's work languish. My connectivity is spotty this week, so I'll stay out of the technical discussion for now, but I want to share a story. Maybe a year ago now, Jonathan Taylor and I were debating what the best API for describing statistical models would be -- whether we wanted something like R's formulas (which I supported), or another approach based on sympy (his idea). To summarize, I thought his API was confusing, pointlessly complicated, and didn't actually solve the problem; he thought R-style formulas were superficially simpler but hopelessly confused and inconsistent underneath. Now, obviously, I was right and he was wrong. Well, obvious to me, anyway... ;-) But it wasn't like I could just wave a wand and make his arguments go away, no I should point out that the implementation hasn't - as far as I can see - changed the discussion. The discussion was about the API. Implementations are useful for agreed APIs because they can point out where the API does not make sense or cannot be implemented. In this case, the API Mark said he was going to implement - he did implement - at least as far as I can see. Again, I'm happy to be corrected. In saying that we are insisting on our way, you are saying, implicitly, 'I am not going to negotiate'. That is only your interpretation. The observation that Mark compromised quite a bit while you didn't seems largely correct to me. The problem here stems from our inability to work towards agreement, rather than standing on set positions. I set out what changes I think would make the current implementation OK. Can we please, please have a discussion about those points instead of trying to argue about who has given more ground. That commitment would of course be good. However, even if that were possible before writing code and everyone agreed that the ideas of you and Nathaniel should be implemented in full, it's still not clear that either of you would be willing to write any code. Agreement without code still doesn't help us very much. I'm going to return to Nathaniel's point - it is a highly valuable thing to set ourselves the target of resolving substantial discussions by consensus. The route you are endorsing here is 'implementor wins'. We don't need to do it that way. We're a mature sensible bunch of adults who can talk out the issues until we agree they are ready for implementation, and then implement. That's all Nathaniel is saying. I think he's obviously right, and I'm sad that it isn't as clear to y'all as it is to me. Best, Matthew Everyone, can we please not do this?! I had enough of adults doing finger pointing back over the summer during the whole debt ceiling debate. I think we can all agree that we are better than the US congress? Forget about rudeness or decision processes. I will start by saying that I am willing to separate ignore and absent, but only on the write side of things. On read, I want a single way to identify the missing values
Re: [Numpy-discussion] consensus (was: NA masks in the next numpy release?)
On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root ben.r...@ou.edu wrote: On Friday, October 28, 2011, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant oliph...@enthought.com wrote: I think Nathaniel and Matthew provided very specific feedback that was helpful in understanding other perspectives of a difficult problem. In particular, I really wanted bit-patterns implemented. However, I also understand that Mark did quite a bit of work and altered his original designs quite a bit in response to community feedback. I wasn't a major part of the pull request discussion, nor did I merge the changes, but I support Charles if he reviewed the code and felt like it was the right thing to do. I likely would have done the same thing rather than let Mark Wiebe's work languish. My connectivity is spotty this week, so I'll stay out of the technical discussion for now, but I want to share a story. Maybe a year ago now, Jonathan Taylor and I were debating what the best API for describing statistical models would be -- whether we wanted something like R's formulas (which I supported), or another approach based on sympy (his idea). To summarize, I thought his API was confusing, pointlessly complicated, and didn't actually solve the problem; he thought R-style formulas were superficially simpler but hopelessly confused and inconsistent underneath. Now, obviously, I was right and he was wrong. Well, obvious to me, anyway... ;-) But it wasn't like I could just wave a wand and make his arguments go away, no I should point out that the implementation hasn't - as far as I can see - changed the discussion. The discussion was about the API. Implementations are useful for agreed APIs because they can point out where the API does not make sense or cannot be implemented. In this case, the API Mark said he was going to implement - he did implement - at least as far as I can see. Again, I'm happy to be corrected. In saying that we are insisting on our way, you are saying, implicitly, 'I am not going to negotiate'. That is only your interpretation. The observation that Mark compromised quite a bit while you didn't seems largely correct to me. The problem here stems from our inability to work towards agreement, rather than standing on set positions. I set out what changes I think would make the current implementation OK. Can we please, please have a discussion about those points instead of trying to argue about who has given more ground. That commitment would of course be good. However, even if that were possible before writing code and everyone agreed that the ideas of you and Nathaniel should be implemented in full, it's still not clear that either of you would be willing to write any code. Agreement without code still doesn't help us very much. I'm going to return to Nathaniel's point - it is a highly valuable thing to set ourselves the target of resolving substantial discussions by consensus. The route you are endorsing here is 'implementor wins'. We don't need to do it that way. We're a mature sensible bunch of adults who can talk out the issues until we agree they are ready for implementation, and then implement. That's all Nathaniel is saying. I think he's obviously right, and I'm sad that it isn't as clear to y'all as it is to me. Best, Matthew Everyone, can we please not do this?! I had enough of adults doing finger pointing back over the summer during the whole debt ceiling debate. I think we can all agree that we are better than the US congress? Forget about rudeness or decision processes. I will start by saying that I am willing to separate ignore and absent, but only on the write side of things. On read, I want a single way to identify the missing values. I also want only a single way to perform calculations (either skip or propagate). An indicator of success would be that people stop using NaNs and magic numbers (-, anyone?) and we could even deprecate nansum(), or at least strongly suggest in its docs to use NA. Well, I haven't completely made up my mind yet, will have to do some more
Re: [Numpy-discussion] NA masks in the next numpy release?
On Mon, Oct 24, 2011 at 10:54 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Oct 24, 2011 at 8:40 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Oct 23, 2011 at 11:23 PM, Wes McKinney wesmck...@gmail.com wrote: On Sun, Oct 23, 2011 at 8:07 PM, Eric Firing efir...@hawaii.edu wrote: On 10/23/2011 12:34 PM, Nathaniel Smith wrote: like. And in this case I do think we can come up with an API that will make everyone happy, but that Mark's current API probably can't be incrementally evolved to become that API.) No one could object to coming up with an API that makes everyone happy, provided that it actually gets coded up, tested, and is found to be fast and maintainable. When you say the API probably can't be evolved, do you mean that the underlying implementation also has to be redone? And if so, who will do it, and when? Eric ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I personally am a bit apprehensive as I am worried about the masked array abstraction leaking through to users of pandas, something which I simply will not accept (why I decided against using numpy.ma early on, that + performance problems). Basically if having an understanding of masked arrays is a prerequisite for using pandas, the whole thing is DOA to me as it undermines the usability arguments I've been making about switching to Python (from R) for data analysis and statistical computing. The missing data functionality looks far more like R than numpy.ma. For instance In [8]: a = arange(5, maskna=1) In [9]: a[2] = np.NA In [10]: a.mean() Out[10]: NA(dtype='float64') In [11]: a.mean(skipna=1) Out[11]: 2.0 In [12]: a = arange(5) In [13]: b = a.view(maskna=1) In [14]: a.mean() Out[14]: 2.0 In [15]: b[2] = np.NA In [16]: b.mean() Out[16]: NA(dtype='float64') In [17]: b.mean(skipna=1) Out[17]: 2.0 Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I don't really agree with you. some sample R code arr - rnorm(10) arr[5:8] - NA arr [1] 0.6451460 -1.1285552 0.6869828 0.4018868 NA NA [7] NA NA 0.3322803 -1.9201257 In your examples you had to pass maskna=True-- I suppose that my only recourse would be to make sure that every array inside a DataFrame, for example, has maskna=True set. I'll have to look in more detail and see if it's feasible/desirable. There's a memory cost to pay, but you can't get the functionality for free. I may just end up sticking with NaN as it's worked pretty well so far the last few years-- it's an impure solution but one with reasonably good performance characteristics in the places that matter. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA masks in the next numpy release?
On Sun, Oct 23, 2011 at 8:07 PM, Eric Firing efir...@hawaii.edu wrote: On 10/23/2011 12:34 PM, Nathaniel Smith wrote: like. And in this case I do think we can come up with an API that will make everyone happy, but that Mark's current API probably can't be incrementally evolved to become that API.) No one could object to coming up with an API that makes everyone happy, provided that it actually gets coded up, tested, and is found to be fast and maintainable. When you say the API probably can't be evolved, do you mean that the underlying implementation also has to be redone? And if so, who will do it, and when? Eric ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I personally am a bit apprehensive as I am worried about the masked array abstraction leaking through to users of pandas, something which I simply will not accept (why I decided against using numpy.ma early on, that + performance problems). Basically if having an understanding of masked arrays is a prerequisite for using pandas, the whole thing is DOA to me as it undermines the usability arguments I've been making about switching to Python (from R) for data analysis and statistical computing. Performance is also a concern, but based on prior discussions it seems a great deal can be done there. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Preventing an ndarray subclass from returning new subclass instances for std(), etc
On Sun, Sep 18, 2011 at 4:25 PM, Pierre GM pgmdevl...@gmail.com wrote: On Sep 18, 2011, at 21:25 , Stéfan van der Walt wrote: On Sun, Sep 18, 2011 at 9:48 AM, Keith Hughitt keith.hugh...@gmail.com wrote: Interesting. It works as expected when called as a method: In [10]: x = np.ma.array([[1,2,3]]) In [11]: x.std() Out[11]: 0.81649658092772603 I'm guessing that np.ma.array implements its own std function, but I haven't checked. It does. np.std tries first to call the .std method of the input argument. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I thought maybe you can intercept 0-dim objects and return self.item() in array_finalize, but not dice. This is really weird: import numpy as np class example(np.ndarray): def __new__(cls, arr): return np.array(arr).view(cls) def __array_finalize__(self, obj): print 'foo' if self.ndim == 0: return self.item() In [6]: foo = example(np.arange(10)) foo In [7]: foo.std() foo foo foo foo foo foo foo Out[7]: example(2.8722813232690143) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug in numpy std, etc. with other data structures?
On Sat, Sep 17, 2011 at 4:48 PM, Skipper Seabold jsseab...@gmail.com wrote: Just ran into this. Any objections for having numpy.std and other functions in core/fromnumeric.py call asanyarray before trying to use the array's method? Other data structures like pandas and larry define their own std method, for instance, and this doesn't allow them to pass through. I'm inclined to say that the issue is with numpy, though maybe the data structures shouldn't shadow numpy array methods while altering the signature. I dunno. df = pandas.DataFrame(np.random.random((10,5))) np.std(df,axis=0) snip TypeError: std() got an unexpected keyword argument 'dtype' np.std(np.asanyarray(df),axis=0) array([ 0.30883352, 0.3133324 , 0.26517361, 0.26389029, 0.20022444]) Though I don't think this would work with larry yet. Pull request: https://github.com/numpy/numpy/pull/160 Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Note I've no real intention of making DataFrame fully ndarray-like-- but it's nice to be able to type: df.std(axis=0) df.std(axis=1) np.sqrt(df) etc. which works the same as ndarray. I suppose the __array__/__array_wrap__ interface is there largely as a convenience. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug in numpy std, etc. with other data structures?
On Sat, Sep 17, 2011 at 8:36 PM, josef.p...@gmail.com wrote: On Sat, Sep 17, 2011 at 5:12 PM, Wes McKinney wesmck...@gmail.com wrote: On Sat, Sep 17, 2011 at 4:48 PM, Skipper Seabold jsseab...@gmail.com wrote: Just ran into this. Any objections for having numpy.std and other functions in core/fromnumeric.py call asanyarray before trying to use the array's method? Other data structures like pandas and larry define their own std method, for instance, and this doesn't allow them to pass through. I'm inclined to say that the issue is with numpy, though maybe the data structures shouldn't shadow numpy array methods while altering the signature. I dunno. df = pandas.DataFrame(np.random.random((10,5))) np.std(df,axis=0) snip TypeError: std() got an unexpected keyword argument 'dtype' np.std(np.asanyarray(df),axis=0) array([ 0.30883352, 0.3133324 , 0.26517361, 0.26389029, 0.20022444]) Though I don't think this would work with larry yet. Pull request: https://github.com/numpy/numpy/pull/160 Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Note I've no real intention of making DataFrame fully ndarray-like-- but it's nice to be able to type: df.std(axis=0) df.std(axis=1) np.sqrt(df) etc. which works the same as ndarray. I suppose the __array__/__array_wrap__ interface is there largely as a convenience. I'm a bit worried about the different ddof defaults in cases like this. Essentially we will not be able to rely on ddof=0 anymore. Different defaults on axis are easy to catch, but having the same function call return sometimes ddof=0 and sometimes ddof=1 might make for some fun debugging. Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Can we lobby for default ddof=1 in NumPy 2.0? Breaking with a convention like this doesn't make much sense to me. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug in numpy std, etc. with other data structures?
On Sat, Sep 17, 2011 at 10:50 PM, Bruce Southey bsout...@gmail.com wrote: On Sat, Sep 17, 2011 at 4:12 PM, Wes McKinney wesmck...@gmail.com wrote: On Sat, Sep 17, 2011 at 4:48 PM, Skipper Seabold jsseab...@gmail.com wrote: Just ran into this. Any objections for having numpy.std and other functions in core/fromnumeric.py call asanyarray before trying to use the array's method? Other data structures like pandas and larry define their own std method, for instance, and this doesn't allow them to pass through. I'm inclined to say that the issue is with numpy, though maybe the data structures shouldn't shadow numpy array methods while altering the signature. I dunno. df = pandas.DataFrame(np.random.random((10,5))) np.std(df,axis=0) snip TypeError: std() got an unexpected keyword argument 'dtype' np.std(np.asanyarray(df),axis=0) array([ 0.30883352, 0.3133324 , 0.26517361, 0.26389029, 0.20022444]) Though I don't think this would work with larry yet. Pull request: https://github.com/numpy/numpy/pull/160 Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion numpy.std() does accepts array-like which obvious means that np.std([1,2,3,5]) works making asanyarray call a total waste of cpu time. Clearly pandas is not array-like input (as Wes points out below) so an error is correct. Doing this type of 'fix' will have unintended consequences when other non-numpy objects are incorrectly passed to numpy functions. Rather you should determine why 'array-like' failed here IF you think a pandas object is either array-like or a numpy object. No, the reason it is failing is because np.std takes the EAFP/duck-typing approach: try: std = a.std except AttributeError: return _wrapit(a, 'std', axis, dtype, out, ddof) return std(axis, dtype, out, ddof) Indeed DataFrame has an std method but it doesn't have the same function signature as ndarray.std. Note I've no real intention of making DataFrame fully ndarray-like-- but it's nice to be able to type: df.std(axis=0) df.std(axis=1) np.sqrt(df) etc. which works the same as ndarray. I suppose the __array__/__array_wrap__ interface is there largely as a convenience. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I consider that the only way pandas or any other numpy-derivative to overcome this is get into numpy/scipy. After all Travis opened the discussion for Numpy 3 which you could still address. Bruce PS Good luck on the ddof thing given the past discussions on it! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA masks for NumPy are ready to test
On Wed, Aug 24, 2011 at 8:19 PM, Mark Wiebe mwwi...@gmail.com wrote: On Fri, Aug 19, 2011 at 11:37 AM, Bruce Southey bsout...@gmail.com wrote: Hi, snip 2) Can the 'skipna' flag be added to the methods? a.sum(skipna=True) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'skipna' is an invalid keyword argument for this function np.sum(a,skipna=True) nan I've added this now, as well. I think that finishes up the changes you suggested in this email which felt right to me. Cheers, Mark snip ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Sorry I haven't had a chance to have a tinker yet. My initial observations: - I haven't decided whether this is a problem: In [50]: arr = np.arange(100) In [51]: arr[5:10] = np.NA --- ValueErrorTraceback (most recent call last) /home/wesm/ipython-input-51-7e07a94409e9 in module() 1 arr[5:10] = np.NA ValueError: Cannot set NumPy array values to NA values without first enabling NA support in the array I assume when you flip the maskna switch that a mask is created? - Performance with skipna is a bit disappointing: In [52]: arr = np.random.randn(1e6) In [54]: arr.flags.maskna = True In [56]: arr[::2] = np.NA In [58]: timeit arr.sum(skipna=True) 100 loops, best of 3: 7.31 ms per loop this goes down to 2.12 ms if there are no NAs present. but: In [59]: import bottleneck as bn In [60]: arr = np.random.randn(1e6) In [61]: arr[::2] = np.nan In [62]: timeit bn.nansum(arr) 1000 loops, best of 3: 1.17 ms per loop do you have a sense if this gap can be closed? I assume you've been, as you should, focused on a correct implementation as opposed with squeezing out performance. best, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Questionable reduceat behavior
On Sun, Aug 14, 2011 at 11:58 AM, Wes McKinney wesmck...@gmail.com wrote: On Sat, Aug 13, 2011 at 8:06 PM, Mark Wiebe mwwi...@gmail.com wrote: Looks like this is the second-oldest open bug in the bug tracker. http://projects.scipy.org/numpy/ticket/236 For what it's worth, I'm in favour of changing this behavior to be more consistent as proposed in that ticket. -Mark On Thu, Aug 11, 2011 at 11:25 AM, Wes McKinney wesmck...@gmail.com wrote: I'm a little perplexed why reduceat was made to behave like this: In [26]: arr = np.ones((10, 4), dtype=bool) In [27]: arr Out[27]: array([[ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True]], dtype=bool) In [30]: np.add.reduceat(arr, [0, 3, 3, 7, 9], axis=0) Out[30]: array([[3, 3, 3, 3], [1, 1, 1, 1], [4, 4, 4, 4], [2, 2, 2, 2], [1, 1, 1, 1]]) this does not seem intuitively correct. Since we have: In [33]: arr[3:3].sum(0) Out[33]: array([0, 0, 0, 0]) I would expect array([[3, 3, 3, 3], [0, 0, 0, 0], [4, 4, 4, 4], [2, 2, 2, 2], [1, 1, 1, 1]]) Obviously I can RTFM and see why it does this (if ``indices[i] = indices[i + 1]``, the i-th generalized row is simply ``a[indices[i]]``), but it doesn't make much sense to me, and I need work around it. Suggestions? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Well, I certainly hope it doesn't get forgotten about for another 5 years. I think having more consistent behavior would be better rather than conforming to a seemingly arbitrary decision made ages ago in Numeric. - Wes just a manual hack for now where I needed it... https://github.com/wesm/pandas/blob/master/pandas/core/frame.py#L2155 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Questionable reduceat behavior
On Sat, Aug 13, 2011 at 8:06 PM, Mark Wiebe mwwi...@gmail.com wrote: Looks like this is the second-oldest open bug in the bug tracker. http://projects.scipy.org/numpy/ticket/236 For what it's worth, I'm in favour of changing this behavior to be more consistent as proposed in that ticket. -Mark On Thu, Aug 11, 2011 at 11:25 AM, Wes McKinney wesmck...@gmail.com wrote: I'm a little perplexed why reduceat was made to behave like this: In [26]: arr = np.ones((10, 4), dtype=bool) In [27]: arr Out[27]: array([[ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True]], dtype=bool) In [30]: np.add.reduceat(arr, [0, 3, 3, 7, 9], axis=0) Out[30]: array([[3, 3, 3, 3], [1, 1, 1, 1], [4, 4, 4, 4], [2, 2, 2, 2], [1, 1, 1, 1]]) this does not seem intuitively correct. Since we have: In [33]: arr[3:3].sum(0) Out[33]: array([0, 0, 0, 0]) I would expect array([[3, 3, 3, 3], [0, 0, 0, 0], [4, 4, 4, 4], [2, 2, 2, 2], [1, 1, 1, 1]]) Obviously I can RTFM and see why it does this (if ``indices[i] = indices[i + 1]``, the i-th generalized row is simply ``a[indices[i]]``), but it doesn't make much sense to me, and I need work around it. Suggestions? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Well, I certainly hope it doesn't get forgotten about for another 5 years. I think having more consistent behavior would be better rather than conforming to a seemingly arbitrary decision made ages ago in Numeric. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Questionable reduceat behavior
I'm a little perplexed why reduceat was made to behave like this: In [26]: arr = np.ones((10, 4), dtype=bool) In [27]: arr Out[27]: array([[ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True], [ True, True, True, True]], dtype=bool) In [30]: np.add.reduceat(arr, [0, 3, 3, 7, 9], axis=0) Out[30]: array([[3, 3, 3, 3], [1, 1, 1, 1], [4, 4, 4, 4], [2, 2, 2, 2], [1, 1, 1, 1]]) this does not seem intuitively correct. Since we have: In [33]: arr[3:3].sum(0) Out[33]: array([0, 0, 0, 0]) I would expect array([[3, 3, 3, 3], [0, 0, 0, 0], [4, 4, 4, 4], [2, 2, 2, 2], [1, 1, 1, 1]]) Obviously I can RTFM and see why it does this (if ``indices[i] = indices[i + 1]``, the i-th generalized row is simply ``a[indices[i]]``), but it doesn't make much sense to me, and I need work around it. Suggestions? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Warning: invalid value encountered in divide
On Mon, Aug 8, 2011 at 7:01 PM, Neal Becker ndbeck...@gmail.com wrote: Warning: invalid value encountered in divide No traceback. How can I get more info on this? Can this warning be converted to an exception so I can get a trace? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Try calling np.seterr(divide='raise') or np.seterr(all='raise') ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] inconsistent semantics for double-slicing
On Wed, Jul 27, 2011 at 5:36 PM, Alex Flint alex.fl...@gmail.com wrote: When applying two different slicing operations in succession (e.g. select a sub-range, then select using a binary mask) it seems that numpy arrays can be inconsistent with respect to assignment: For example, in this case an array is modified: In [6]: A = np.arange(5) In [8]: A[:][A2] = 0 In [10]: A Out[10]: array([0, 1, 2, 0, 0]) Whereas here the original array remains unchanged In [11]: A = np.arange(5) In [12]: A[[0,1,2,3,4]][A2] = 0 In [13]: A Out[13]: array([0, 1, 2, 3, 4]) This arose in a less contrived situation in which I was trying to copy a small image into a large image, modulo a mask on the small image. Is this meant to be like this? Alex ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion When you do this: A[[0,1,2,3,4]][A2] = 0 what is happening is: A.__getitem__([0,1,2,3,4]).__setitem__(A 2, 0) Whenever you do getitem with fancy indexing (i.e. A[[0,1,2,3,4]]), it produces a new object. In the first case, slicing A[:] produces a view on the same data. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] named ndarray axes
On Wed, Jul 13, 2011 at 7:15 PM, Craig Yoshioka crai...@me.com wrote: Yup exactly. To enable this sort of tracking I needed to explicitly reverse-engineer the effects of indexing on axes. I figure overriding indexing catches most cases that modify axes, but other holes need to be plugged as well... ie: tranpose, swapaxes. This probably means most C functions that change array axes (np.mean(axis=), etc.) need to be covered as well that sucks. BTW, it sounds like you're trying to track very similar data. I am trying to load structural biology data formats, and I try to preserve as much of the metadata as possible, ie: I am encoding unit cell length/angle information as vectors, etc. Here is my implementation: def __getitem__(self,idxs): idxs,trans,keep = idx_axes(self,idxs) array = self.view(np.ndarray).transpose(trans) array = array[idxs] if isinstance(array,ndarray): array = array.view(self.__class__) array.axes = self.axes.transpose(keep) return array def idx_axes(array,idxs): # explicitly expand ellipsis expanded_idxs = idx_expanded(array.ndim,idxs) # determine how the axes will be rearranged as a result of axes-based indexing # and the creation of newaxes remapped_axes = idx_axes_remapped(array.ndim,array.axes,expanded_idxs) # determine numpy compatible transpose, before newaxes are created transposed_axes = idx_axes_transposed(remapped_axes) # determine numpy compatible indexes with axes-based indexing removed filtered_idxs = idx_filtered(expanded_idxs) # determine which axes will be kept after numpy indexing kept_axes = idx_axes_kept(remapped_axes,filtered_idxs) return filtered_idxs,transposed_axes,kept_axes def idx_expanded(ndim,idxs): ''' explicitly expands ellipsis taking into account newaxes ''' if not isinstance(idxs,tuple): return idx_expanded(ndim,(idxs,)) # how many dimensions we will end up having ndim = ndim + idxs.count(newaxis) filler = slice(None) def skip_ellipsis(idxs): return tuple([filler if isinstance(x,type(Ellipsis)) else x for x in idxs]) def fill_ellipsis(ndim,l,r): return (filler,)*(ndim-len(l)-len(r)) # expand first ellipsis, treat all other ellipsis as slices if Ellipsis in idxs: idx = idxs.index(Ellipsis) llist = idxs[:idx] rlist = skip_ellipsis(idxs[idx+1:]) cfill = fill_ellipsis(ndim,llist,rlist) idxs = llist + cfill + rlist return idxs def idx_filtered(idxs): ''' replace indexes on axes with normal numpy indexes ''' def axisAsIdx(idx): if isinstance(idx,str): return slice(None) elif isinstance(idx,slice): if isinstance(idx.start,str): return idx.stop return idx return tuple([axisAsIdx(x) for x in idxs]) def idx_axes_remapped(ndim,axes,idxs): ''' if given a set of array indexes that contain labeled axes, return a tuple that maps axes in the source array to the axes as they will end up in the destination array. Must take into account the spaces created by newaxis indexes. ''' # how many dims are we expecting? ndim = ndim + idxs.count(newaxis) # new unique object for marking unassigned locations in mapping unassigned = object() # by default all locations are unsassigned mapping = [unassigned] * ndim # find labels in indexes and set the dims for those locations for dim,label in enumerate(idxs): if label == newaxis: mapping[dim] = label elif isinstance(label,str): mapping[dim] = axes.dimForLabel(label) elif isinstance(label,slice): if isinstance(label.start,str): mapping[dim] = axes.dimForLabel(label.start) # find unassigned dims, in order unmapped = [d for d in range(ndim) if d not in set(mapping)] # fill in remaining unassigned locations with dims for dst,src in enumerate(mapping): if src == unassigned: mapping[dst] = unmapped.pop(0) return tuple(mapping) def idx_axes_transposed(mapping): ''' stripping out newaxes in mapping yields a tuple compatible with transpose ''' return tuple([x for x in mapping if x != newaxis]) def idx_axes_kept(mapping,idxs): ''' remove axes from mapping that will not survive the indexing (ie: ints) ''' kept = [] first_list = True for dst,src in enumerate(mapping): if dst len(idxs): idx = idxs[dst] if isinstance(idx,int): continue elif isinstance(idx,list): if not first_list: continue first_list = False kept += [src] return tuple(kept) On Jul 13, 2011, at 3:13 PM, Sam Quinan wrote: I'm currently working on interfacing ndarrays with a
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney wesmck...@gmail.com wrote: On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith n...@pobox.com wrote: On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root ben.r...@ou.edu wrote: On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith n...@pobox.com wrote: This is a situation where I would just... use an array and a mask, rather than a masked array. Then lots of things -- changing fill values, temporarily masking/unmasking things, etc. -- come from free, just from knowing how arrays and boolean indexing work? With a masked array, it is for free. Why re-invent the wheel? It has already been done for me. But it's not for free at all. It's an additional concept that has to be maintained, documented, and learned (with the last cost, which is multiplied by the number of users, being by far the greatest). It's not reinventing the wheel, it's saying hey, I have wheels and axles, but what I really need the library to provide is a wheel+axle assembly! You're communicating my argument better than I am. Do we really get much advantage by building all these complex operations in? I worry that we're trying to anticipate and write code for every situation that users find themselves in, instead of just giving them some simple, orthogonal tools. This is the danger, and which is why I advocate retaining the MaskedArray type that would provide the high-level intelligent operations, meanwhile having in the core the basic data structures for pairing a mask with an array, and to recognize a special np.NA value that would act upon the mask rather than the underlying data. Users would get very basic functionality, while the MaskedArray would continue to provide the interface that we are used to. The interface as described is quite different... in particular, all aggregate operations would change their behavior. As a corollary, I worry that learning and keeping track of how masked arrays work is more hassle than just ignoring them and writing the necessary code by hand as needed. Certainly I can imagine that *if the mask is a property of the data* then it's useful to have tools to keep it aligned with the data through indexing and such. But some of these other things are quicker to reimplement than to look up the docs for, and the reimplementation is easier to read, at least for me... What you are advocating is similar to the tried-n-true coding practice of Matlab users of using NaNs. You will hear from Matlab programmers about how it is the greatest idea since sliced bread (and I was one of them). Then I was introduced to Numpy, and I while I do sometimes still do the NaN approach, I realized that the masked array is a better way. Hey, no need to go around calling people Matlab programmers, you might hurt someone's feelings. But seriously, my argument is that every abstraction and new concept has a cost, and I'm dubious that the full masked array abstraction carries its weight and justifies this cost, because it's highly redundant with existing abstractions. That has nothing to do with how tried-and-true anything is. +1. I think I will personally only be happy if masked array can be implemented while incurring near-zero cost from the end user perspective. If what we end up with is a faster implementation of numpy.ma in C I'm probably going to keep on using NaN... That's why I'm entirely insistent that whatever design be dogfooded on non-expert users. If it's very much harder / trickier / nuanced than R, you will have failed. This sounds unduly pessimistic to me. It's one thing to suggest different approaches, another to cry doom and threaten to go eat worms. And all before the code is written, benchmarks run, or trial made of the usefulness of the approach. Let us see how things look as they get worked out. Mark has a good track record for innovative tools and I'm rather curious myself to see what the result is. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I hope you're right. So far it seems that anyone who has spent real time with R (e.g. myself, Nathaniel) has expressed serious concerns about the masked approach. And we got into this discussion at the Data Array summit in Austin last month because we're trying to make Python more competitive with R viz statistical and financial applications. I'm just trying to be (R)ealistic =P Remember that I very earnestly am doing everything I can these days to make scientific Python more successful in finance and statistics. One big difference with R's approach is that we care more about performance the the R community does. So maybe having special NA values will be prohibitive
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 1:05 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett matthew.br...@gmail.com wrote: So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. It seems like some of us have been talking past each other a lot, where someone says but changing masks is the single most important feature! and then someone else says what are you talking about that doesn't even make sense. To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) - np.NA 2) - 0.0 Yes -- and in R you get actually do get NA, while in numpy.ma you actually do get 0. I don't think this is a coincidence; I think it's because they're designed as coherent systems that are trying to solve different problems. (Well, numpy.ma's hardmask idea seems inspired by the missing-data concept rather than the temporary-mask concept, but aside from that it seems pretty consistent in implementing option 2.) Agree. My basic observation about numpy.ma is that it's a finely crafted solution for a different set of problems than the ones I have. I just don't want the same thing to happen here so I'm stuck writing code (like I am now) that looks like mask = y.mask the_sum = y.sum(axis) the_count = mask.sum(axis) the_sum[the_count == 0] = nan Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. They're both internally consistent, but I think we might have to make a decision and stick to it. I agree it's good to separate the API from the implementation. I think the implementation is also important because I care about memory and possibly speed. But, that is a separate problem from the API... Yes, absolutely memory and speed are important. But a really fast solution to the wrong problem isn't so useful either :-). -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concepts for masked/missing data
On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root ben.r...@ou.edu wrote: On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith n...@pobox.com wrote: I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. That is quite a trickier problem. It can be. I think of it as the difference between design and coding. They overlap less than one might expect... Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. I can agree with this distinction. However, if missingness is an intrinsic property of the data, then shouldn't users be implementing their own dtype tailored to the data they are using? In other words, how far does the core of NumPy need to go to address this issue? And how far would be too much? Yes, that's exactly my question: whether our goal is to implement missingness in numpy or not! They're both internally consistent, but I think we might have to make a decision and stick to it. Of course. I think that Mark is having a very inspired idea of giving the R audience what they want (np.NA), while simultaneously making the use of masked arrays even easier (which I can certainly appreciate). I don't know. I think we could build a really top-notch implementation of missingness. I also think we could build a really top-notch implementation of masking. But my suggestions for how to improve the current design are totally different depending on which of those is the goal, and neither the R audience (like me) nor the masked array audience (like you) seems really happy with the current design. And I don't know what the goal is -- maybe it's something else and the current design hits it perfectly? Maybe we want a top-notch implementation of *both* missingness and masking, and those should be two different things that can be combined, so that some of the unmasked values inside a masked array can be NA? I don't know. I will put out a little disclaimer. I once had to use S+ for a class. To be honest, it was the worst programming experience in my life. This experience may be coloring my perception of R's approach to handling missing data. There's a lot of things that R does wrong (not their fault; language design is an extremely difficult and specialized skill, that statisticians are not exactly trained in), but it did make a few excellent choices at the beginning. One was to steal the execution model from Scheme, which, uh, isn't really relevant here. The other was to steal the basic data types and standard library that the Bell Labs statisticians had pounded into shape over many years. I use Python now because using R for everything would drive me crazy, but despite its many flaws, it still does some things so well that it's become *the* language used for basically all statistical research. I'm only talking about stealing those things :-). -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion +1. Everyone knows R ain't perfect. I think it's an atrociously bad programming language but it can be unbelievably good at statistics, as evidenced by its success. Brings to mind Andy Gelman's blog last fall: http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html As someone in a statistics department I've frequently been disheartened when I see how easy many statistical things are in R and how much more difficult they are in Python. This is partially the result of poor interfaces for statistical modeling, partially due to data structures (e.g. the integrated-ness of data.frame throughout R) and things like handling of missing data of which there's currently no equivalent. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
On Fri, Jun 24, 2011 at 12:33 PM, Mark Wiebe mwwi...@gmail.com wrote: On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith n...@pobox.com wrote: On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe mwwi...@gmail.com wrote: On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith n...@pobox.com wrote: It's should also be possible to accomplish a general solution at the dtype level. We could have a 'dtype factory' used like: np.zeros(10, dtype=np.maybe(float)) where np.maybe(x) returns a new dtype whose storage size is x.itemsize + 1, where the extra byte is used to store missingness information. (There might be some annoying alignment issues to deal with.) Then for each ufunc we define a handler for the maybe dtype (or add a special-case to the ufunc dispatch machinery) that checks the missingness value and then dispatches to the ordinary ufunc handler for the wrapped dtype. The 'dtype factory' idea builds on the way I've structured datetime as a parameterized type, but the thing that kills it for me is the alignment problems of 'x.itemsize + 1'. Having the mask in a separate memory block is a lot better than having to store 16 bytes for an 8-byte int to preserve the alignment. Hmm. I'm not convinced that this is the best approach either, but let me play devil's advocate. The disadvantage of this approach is that masked arrays would effectively have a 100% memory overhead over regular arrays, as opposed to the shadow mask approach where the memory overhead is 12.5%--100% depending on the size of objects being stored. Probably the most common case is arrays of doubles, in which case it's 100% versus 12.5%. So that sucks. But on the other hand, we gain: -- simpler implementation: no need to be checking and tracking the mask buffer everywhere. The needed infrastructure is already built in. I don't believe this is true. The dtype mechanism would need a lot of work to build that needed infrastructure first. The analysis I've done so far indicates the masked approach will give a simpler/cleaner implementation. -- simpler conceptually: we already have the dtype concept, it's a very powerful and we use it for all sorts of things; using it here too plays to our strengths. We already know what a numpy scalar is and how it works. Everyone already understands how assigning a value to an element of an array works, how it interacts with broadcasting, etc., etc., and in this model, that's all a missing value is -- just another value. From Python, this aspect of things would be virtually identical between the two mechanisms. The dtype approach would require more coding and overhead where you have to create copies of your data to convert it into the parameterized NA[int32] dtype, versus with the masked approach where you say x.flags.hasmask = True or something like that without copying the data. -- it composes better with existing functionality: for example, someone mentioned the distinction between a missing field inside a record versus a missing record. In this model, that would just be the difference between dtype([(x, maybe(float))]) and maybe(dtype([(x, float)])). Indeed, the difference between an NA[:x:f4, :y:f4] versus :x:NA[f4], :y:NA[f4] can't be expressed the way I've designed the mask functionality. (Note, this struct dtype string representation isn't actually supported in NumPy.) Optimization is important and all, but so is simplicity and robustness. That's why we're using Python in the first place :-). If we think that the memory overhead for floating point types is too high, it would be easy to add a special case where maybe(float) used a distinguished NaN instead of a separate boolean. The extra complexity would be isolated to the 'maybe' dtype's inner loop functions, and transparent to the Python level. (Implementing a similar optimization for the masking approach would be really nasty.) This would change the overhead comparison to 0% versus 12.5% in favor of the dtype approach. Yeah, there would be no such optimization for the masked approach. If someone really wants this, they are not precluded from also implementing their own nafloat dtype which operates independently of the masking mechanism. -Mark -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I don't have enough time to engage in this discussion as I'd like but I'll give my input. I've spent a very amount of time in pandas trying to craft a sensible and performant missing-data-handling solution giving existing tools which does not get in the user's way and which also works for non-floating point data. The result works but isn't completely 100% satisfactory, and I
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root ben.r...@ou.edu wrote: ... Again, there are pros and cons either way and I see them very orthogonal and complementary. That may be true, but I imagine only one of them will be implemented. @Mark - I don't have a clear idea whether you consider the nafloat64 option to be still in play as the first thing to be implemented (before array.mask). If it is, what kind of thing would persuade you either way? Mark can speak for himself, but I think things are tending towards masks. They have the advantage of one implementation for all data types, current and future, and they are more flexible since the masked data can be actual valid data that you just choose to ignore for experimental reasons. What might be helpful is a routine to import/export R files, but that shouldn't be to difficult to implement. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Perhaps we should make a wiki page someplace summarizing pros and cons of the various implementation approaches? I worry very seriously about adding API functions relating to masks rather than having special NA values which propagate in algorithms. The question is: will Joe Blow Former R user have to understand what is the mask and how to work with it? If the answer is yes we have a problem. If it can be completely hidden as an implementation detail, that's great. In R NAs are just sort of inherent-- they propagate you deal with them when you have to via na.rm flag in functions or is.na. The other problem I can think of with masks is the extra memory footprint, though maybe this is no cause for concern. -W ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
On Fri, Jun 24, 2011 at 8:02 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Jun 24, 2011 at 5:22 PM, Wes McKinney wesmck...@gmail.com wrote: On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root ben.r...@ou.edu wrote: ... Again, there are pros and cons either way and I see them very orthogonal and complementary. That may be true, but I imagine only one of them will be implemented. @Mark - I don't have a clear idea whether you consider the nafloat64 option to be still in play as the first thing to be implemented (before array.mask). If it is, what kind of thing would persuade you either way? Mark can speak for himself, but I think things are tending towards masks. They have the advantage of one implementation for all data types, current and future, and they are more flexible since the masked data can be actual valid data that you just choose to ignore for experimental reasons. What might be helpful is a routine to import/export R files, but that shouldn't be to difficult to implement. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Perhaps we should make a wiki page someplace summarizing pros and cons of the various implementation approaches? I worry very seriously about adding API functions relating to masks rather than having special NA values which propagate in algorithms. The question is: will Joe Blow Former R user have to understand what is the mask and how to work with it? If the answer is yes we have a problem. If it can be completely hidden as an implementation detail, that's great. In R NAs are just sort of inherent-- they propagate you deal with them when you have to via na.rm flag in functions or is.na. Well, I think both of those can be pretty transparent. Could you illustrate some typical R usage, to wit. 1) setting a value to na 2) checking a value for na Other things are problematic, like checking for integer overflow. For safety that would be desireable, for speed not. I think that is a separate question however. In any case, if we do check such things we should be able to set the corresponding mask value in the loop, and I suppose that is the sort of thing you want. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I think anyone making decisions about this needs to have a pretty good understanding of what R does. So here's some examples but you guys really need to spend some time with R if you have not already arr - rnorm(20) arr [1] 1.341960278 0.757033314 -0.910468762 -0.475811935 -0.007973053 [6] 1.618201117 -0.965747088 0.386811224 0.229158237 0.987050613 [11] 1.293453170 -2.432399045 -0.247593481 -0.639769586 -0.464996583 [16] 0.720181047 0.846607030 0.486173088 -0.911247626 0.370326788 arr[5:10] = NA arr [1] 1.3419603 0.7570333 -0.9104688 -0.4758119 NA NA [7] NA NA NA NA 1.2934532 -2.4323990 [13] -0.2475935 -0.6397696 -0.4649966 0.7201810 0.8466070 0.4861731 [19] -0.9112476 0.3703268 is.na(arr) [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE mean(arr) [1] NA mean(arr, na.rm=T) [1] -0.01903945 arr + rnorm(20) [1] 2.081580297 0.505050028 -0.696287035 -1.280323279 NA [6] NA NA NA NA NA [11] 2.166078369 -1.445271291 0.764894624 0.795890929 0.549621207 [16] 0.005215596 -0.170001426 0.712335355 -0.919671745 -0.617099818 and obviously this is OK too: arr - rep('wes', 10) arr[5:7] - NA is.na(arr) [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE note, NA gets excluded from categorical variables (factors): as.factor(arr) [1] wes wes wes wes NA NA NA wes wes wes Levels: wes e.g. groupby with NA: tapply(rnorm(10), arr, mean) wes -0.5271853 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith n...@pobox.com wrote: On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root ben.r...@ou.edu wrote: On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith n...@pobox.com wrote: This is a situation where I would just... use an array and a mask, rather than a masked array. Then lots of things -- changing fill values, temporarily masking/unmasking things, etc. -- come from free, just from knowing how arrays and boolean indexing work? With a masked array, it is for free. Why re-invent the wheel? It has already been done for me. But it's not for free at all. It's an additional concept that has to be maintained, documented, and learned (with the last cost, which is multiplied by the number of users, being by far the greatest). It's not reinventing the wheel, it's saying hey, I have wheels and axles, but what I really need the library to provide is a wheel+axle assembly! You're communicating my argument better than I am. Do we really get much advantage by building all these complex operations in? I worry that we're trying to anticipate and write code for every situation that users find themselves in, instead of just giving them some simple, orthogonal tools. This is the danger, and which is why I advocate retaining the MaskedArray type that would provide the high-level intelligent operations, meanwhile having in the core the basic data structures for pairing a mask with an array, and to recognize a special np.NA value that would act upon the mask rather than the underlying data. Users would get very basic functionality, while the MaskedArray would continue to provide the interface that we are used to. The interface as described is quite different... in particular, all aggregate operations would change their behavior. As a corollary, I worry that learning and keeping track of how masked arrays work is more hassle than just ignoring them and writing the necessary code by hand as needed. Certainly I can imagine that *if the mask is a property of the data* then it's useful to have tools to keep it aligned with the data through indexing and such. But some of these other things are quicker to reimplement than to look up the docs for, and the reimplementation is easier to read, at least for me... What you are advocating is similar to the tried-n-true coding practice of Matlab users of using NaNs. You will hear from Matlab programmers about how it is the greatest idea since sliced bread (and I was one of them). Then I was introduced to Numpy, and I while I do sometimes still do the NaN approach, I realized that the masked array is a better way. Hey, no need to go around calling people Matlab programmers, you might hurt someone's feelings. But seriously, my argument is that every abstraction and new concept has a cost, and I'm dubious that the full masked array abstraction carries its weight and justifies this cost, because it's highly redundant with existing abstractions. That has nothing to do with how tried-and-true anything is. +1. I think I will personally only be happy if masked array can be implemented while incurring near-zero cost from the end user perspective. If what we end up with is a faster implementation of numpy.ma in C I'm probably going to keep on using NaN... That's why I'm entirely insistent that whatever design be dogfooded on non-expert users. If it's very much harder / trickier / nuanced than R, you will have failed. As for documentation, on hard/soft masks, just look at the docs for the MaskedArray constructor: [...snipped...] Thanks! -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
On Wed, Jun 8, 2011 at 8:53 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jun 8, 2011 at 4:57 AM, Wes McKinney wesmck...@gmail.com wrote: snip So in summary, w.r.t. time series data and datetime, the only things I care about from a datetime / pandas point of view: - Ability to easily define custom timedeltas Can you elaborate on this a bit? I'm guessing you're not referring to the timedelta64 in NumPy, which is simply an integer with an associated unit. I guess what I am thinking of may not need to be available at the NumPy level. As an example, suppose you connected a timedelta (DateOffset, in pandas parlance) to a set of holidays, so when you say: date + BusinessDay(1, calendar='US') then you have some custom logic which knows how to describe the result of that. Some things along these lines have already been discussed-- but this is the basic way that the date offsets work in pandas, i.e. subclasses of DateOffset which implement custom logic. Probably too high level for NumPy. - Generate datetime objects, or some equivalent, which can be used to back pandas data structures Do you mean mechanisms to generate sequences of datetime's? I'm fixing up arange to work reasonably with datetimes at the moment. Yes, that will be very nice. - (possible now??) Ability to have a set of frequency-naive dates (possibly not in order). This should work just as well as if you have an arbitrary set of integers specifying the locations of the sample points. Cheers, Mark Cool (!). This last point actually matters. Suppose you wanted to get the worst 5-performing days in the SP 500 index: In [7]: spx.index Out[7]: class 'pandas.core.daterange.DateRange' offset: 1 BusinessDay, tzinfo: None [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] length: 2963 # but this is OK In [8]: spx.order()[:5] Out[8]: 2008-10-15 00:00:00 -0.0903497960942 2008-12-01 00:00:00 -0.0892952780505 2008-09-29 00:00:00 -0.0878970494885 2008-10-09 00:00:00 -0.0761670761671 2008-11-20 00:00:00 -0.0671229140321 - W ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
On Wed, Jun 8, 2011 at 7:36 AM, Chris Barker chris.bar...@noaa.gov wrote: On 6/7/11 4:53 PM, Pierre GM wrote: Anyhow, each time yo read 'frequency' in scikits.timeseries, think 'unit'. or maybe precision -- when I think if unit, I think of something that can be represented as a floating point value -- but here, with integers, it's the precision that can be represented. Just a thought. Well, it can be argued that the epoch is 0... yes, but that really should be transparent to the user -- what epoch is chosen should influence as little as possible (e.g. only the range of values representable) Mmh. How would you define a quarter unit ? [3M] ? But then, what if you want your year to start in December, say (we often use DJF/MAM/JJA/SON as a way to decompose a year in four 'hydrological' seasons, for example) And the federal fiscal year is Oct - Sept, so the first quarter is (Oct, Nov, Dec) -- clearly that needs to be flexible. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Your guys' discussion is a bit overwhelming for me in my currently jet-lagged state ( =) ) but I thought I would comment on a couple things, especially now with the input of another financial Python user (great!). Note that I use scikits.timeseries very little for a few reasons (a bit OT, but...): - Fundamental need to be able to work with multiple time series, especially performing operations involving cross-sectional data - I think it's a bit hard for lay people to use (read: ex-MATLAB/R users). This is just my opinion, but a few years ago I thought about using it and concluded that teaching people how to properly use it (a precision tool, indeed!) was going to cause me grief. - The data alignment problem, best explained in code: In [8]: ts Out[8]: 2000-01-05 00:00:000.0503706684002 2000-01-12 00:00:00-1.7660004939 2000-01-19 00:00:001.11716758554 2000-01-26 00:00:00-0.171029995265 2000-02-02 00:00:00-0.99876580126 2000-02-09 00:00:00-0.262729046405 In [9]: ts.index Out[9]: class 'pandas.core.daterange.DateRange' offset: 1 Week: kwds={'weekday': 2}, weekday=2, tzinfo: None [2000-01-05 00:00:00, ..., 2000-02-09 00:00:00] length: 6 In [10]: ts2 = ts[:4] In [11]: ts2.index Out[11]: class 'pandas.core.daterange.DateRange' offset: 1 Week: kwds={'weekday': 2}, weekday=2, tzinfo: None [2000-01-05 00:00:00, ..., 2000-01-26 00:00:00] length: 4 In [12]: ts + ts2 Out[12]: 2000-01-05 00:00:000.1007413368 2000-01-12 00:00:00-3.5320009878 2000-01-19 00:00:002.23433517109 2000-01-26 00:00:00-0.34205999053 2000-02-02 00:00:00NaN 2000-02-09 00:00:00NaN Or ts / or ts2 could be completely DateRange-naive (e.g. they have no way of knowing that they are fixed-frequency), or even out of order, and stuff like this will work no problem. I view the fixed frequency issue as sort of an afterthought-- if you need it, it's there for you (the DateRange class is a valid Index--label vector--for pandas objects, and provides an API for defining custom time deltas). Which leads me to: - Inability to derive custom offsets: I can do: In [14]: ts.shift(2, offset=2 * datetools.BDay()) Out[14]: 2000-01-11 00:00:000.0503706684002 2000-01-18 00:00:00-1.7660004939 2000-01-25 00:00:001.11716758554 2000-02-01 00:00:00-0.171029995265 2000-02-08 00:00:00-0.99876580126 2000-02-15 00:00:00-0.262729046405 or even generate, say, 5-minutely or 10-minutely date ranges thusly: In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00', offset=datetools.Minute(5)) Out[16]: class 'pandas.core.daterange.DateRange' offset: 5 Minutes, tzinfo: None [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00] length: 85 I'm currently working on high perf reduceat-based resampling methods (e.g. converting secondly data to 5-minutely data). So in summary, w.r.t. time series data and datetime, the only things I care about from a datetime / pandas point of view: - Ability to easily define custom timedeltas - Generate datetime objects, or some equivalent, which can be used to back pandas data structures - (possible now??) Ability to have a set of frequency-naive dates (possibly not in order). This last point actually matters. Suppose you wanted to get the worst 5-performing days in the SP 500 index: In [7]: spx.index Out[7]: class 'pandas.core.daterange.DateRange' offset: 1 BusinessDay, tzinfo: None [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] length: 2963 # but this is OK In [8]: spx.order()[:5] Out[8]: 2008-10-15 00:00:00-0.0903497960942 2008-12-01 00:00:00-0.0892952780505 2008-09-29 00:00:00-0.0878970494885
Re: [Numpy-discussion] fixing up datetime
On Wed, Jun 8, 2011 at 6:37 AM, Dave Hirschfeld dave.hirschf...@gmail.com wrote: Wes McKinney wesmckinn at gmail.com writes: - Fundamental need to be able to work with multiple time series, especially performing operations involving cross-sectional data - I think it's a bit hard for lay people to use (read: ex-MATLAB/R users). This is just my opinion, but a few years ago I thought about using it and concluded that teaching people how to properly use it (a precision tool, indeed!) was going to cause me grief. - The data alignment problem, best explained in code: - Inability to derive custom offsets: I can do: In [14]: ts.shift(2, offset=2 * datetools.BDay()) Out[14]: 2000-01-11 00:00:00 0.0503706684002 2000-01-18 00:00:00 -1.7660004939 2000-01-25 00:00:00 1.11716758554 2000-02-01 00:00:00 -0.171029995265 2000-02-08 00:00:00 -0.99876580126 2000-02-15 00:00:00 -0.262729046405 or even generate, say, 5-minutely or 10-minutely date ranges thusly: In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00', offset=datetools.Minute(5)) Out[16]: class 'pandas.core.daterange.DateRange' offset: 5 Minutes, tzinfo: None [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00] length: 85 It would be nice to have a step argument in the date_array function. The following works though: In [96]: integers = r_[ts.Date('T',01-Aug-2011 05:00).value: ts.Date('T',06-Aug-2011 12:01).value: 5] In [97]: ts.date_array(integers, freq='T') Out[97]: DateArray([01-Aug-2011 05:00, 01-Aug-2011 05:05, 01-Aug-2011 05:10, ..., 06-Aug-2011 11:45, 06-Aug-2011 11:50, 06-Aug-2011 11:55, ..., 06-Aug-2011 12:00], freq='T') - (possible now??) Ability to have a set of frequency-naive dates (possibly not in order). This last point actually matters. Suppose you wanted to get the worst 5-performing days in the SP 500 index: In [7]: spx.index Out[7]: class 'pandas.core.daterange.DateRange' offset: 1 BusinessDay, tzinfo: None [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] length: 2963 # but this is OK In [8]: spx.order()[:5] Out[8]: 2008-10-15 00:00:00 -0.0903497960942 2008-12-01 00:00:00 -0.0892952780505 2008-09-29 00:00:00 -0.0878970494885 2008-10-09 00:00:00 -0.0761670761671 2008-11-20 00:00:00 -0.0671229140321 - W Seems like you're looking for the tssort function: In [90]: series = ts.time_series(randn(365),start_date='01-Jan-2011',freq='D') In [91]: def tssort(series): : indices = series.argsort() : return ts.time_series(series._series[indices], : series.dates[indices], : autosort=False) : In [92]: tssort(series)[0:5] Out[92]: timeseries([-2.96602612 -2.81612524 -2.61558511 -2.59522921 -2.4899447 ], dates = [26-Aug-2011 18-Apr-2011 27-Aug-2011 21-Aug-2011 19-Nov-2011], freq = D) Pandas seems to offer an even higher level api than scikits.timeseries. I view them as mostly complementary but I haven't (yet) had much experience with pandas... -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I admit that my partial ignorance about scikits.timeseries (due to seldom use) makes my statements a bit unfair-- but you're right in that it's a high-level versus low-level thing. Just about everything is possible but may require a certain level of experience / sophistication / deep understanding (as you have). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
On Mon, Jun 6, 2011 at 8:16 AM, Mark Dickinson mdickin...@enthought.com wrote: On Thu, Jun 2, 2011 at 5:42 PM, Mark Wiebe mwwi...@gmail.com wrote: Leap years are easy compared with leap seconds. Leap seconds involve a hardcoded table of particular leap-seconds that are added or subtracted, and are specified roughly 6 months in advance of when they happen by the International Earth Rotation and Reference Systems Service (IERS). The POSIX time_t doesn't store leap seconds, so if you subtract two time_t values you may get the wrong answer by up to 34 seconds (or 24, I'm not totally clear on the initial 10 second jump that happened somewhere). Times in the future would be hairy, too. If leap seconds are supported, how many seconds are there in the timedelta: 2015-01-01 00:00:00 - 2010-01-01 00:00:00 ? Is it acceptable for the result of the subtraction like this to change when the leap second table is updated (e.g., to reflect a newly added leap second on June 30th, 2012)? Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Tangential to the technical details being discussed, but my $0.02 on datetime generally: One thing I tried to do in pandas is enable the implementation of custom time deltas (date offsets in pandas parlance). For example, let's say add 5 weekdays (or business days) to a particular date. Now, there are some arbitrary choices you have to make, e.g. what do you do when the date being added to is not a business day? If you look in https://github.com/wesm/pandas/blob/master/pandas/core/datetools.py you can see some examples. Things are a bit of a mess in places but the key ideas are there. One nice thing about this framework is that you can create subclasses of DateOffset that connect to some external source of information, e.g. trading exchange information-- so if you want to shift by trading days for that exchange-- e.g. include holidays and other closures, you can do that relatively seamlessly. Downside is that it's fairly slow generally speaking. I'm hopeful that the datetime64 dtype will enable scikits.timeseries and pandas to consolidate much ofir the datetime / frequency code. scikits.timeseries has a ton of great stuff for generating dates with all the standard fixed frequencies. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
On Wed, Jun 1, 2011 at 10:29 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Jun 1, 2011 at 3:16 PM, Mark Wiebe mwwi...@gmail.com wrote: On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris charlesr.har...@gmail.com wrote: snip Just a quick comment, as this really needs more thought, but time is a bag of worms. Certainly a bag of worms, I agree. Trying to represent some standard -- say seconds at the solar system barycenter to account for general relativity -- is something that I think is too complicated and specialized to put into numpy. Good support for units and delta times is very useful, This part works fairly well now, except for some questions like what should datetime(2011-01-30, D) + timedelta(1, M) produce. Maybe 2011-02-28, or 2011-03-02? but parsing dates and times I've implemented an almost-ISO 8601 date-time parser. I had to deviate a bit to support years outside the little 1-year window we use. I think supporting more formats could be handled by writing a function which takes its date format and outputs ISO 8601 format to feed numpy. Heh, 1 years is way outside the window in which the Gregorian calender is useful anyway. Now, strict Julian days start at the creation of the world so that's a real standard, sort of like degrees Kelvins ;) and handling timezones, daylight savings, The only timezone to handle is local, which it looks like standard C has a rich enough API for this to be ok. leap seconds, This one can be mitigated by referencing TAI or GPS time, I think. POSIX time looks like a can of worms though. business days, I think Enthought has some interest in a decent business day functionality. Some feedback from people doing time series and financial modeling would help clarify this though. This link may provide some context: http://www.financialcalendar.com/Data/Holidays/Formats Well, that can change. Likewise, who wants to track ERS for the leap seconds? I think the important thing is that the low level support needed to implement such things is available. A package for common use cases that serves as an example and that can be extended/modified by others would also be helpful. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Being involved in financial data analysis (e.g. via pandas) this stuff will affect me very directly. I'll get back to you soon with some comments. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] labeled axes
On Tue, May 24, 2011 at 6:39 PM, Craig Yoshioka crai...@me.com wrote: Hi all, I've read some discussions about adding labeled axes, and even ticks, to numpy arrays (such as in Luis' dataarray). I have recently found that the ability to label axes would be very helpful to me, but I'd like to keep the implementation as lightweight as possible. The reason I would find this useful is because I am writing a ndarray subclass that loads image/volume file formats into numpy arrays. Some of these files might have multiple images/volumes, I'll call them channels, and also may have an additional dimension for vectors associated with each pixel/voxel, like color. The max dims of the array would then be 5. Example: data = ndarray([1023,128,128,128,3]) might mean (channels,z,y,x,rgb) for one array. Now I want to keep as much of the fancy indexing capabilities of numpy as I can, but I am finding it difficult to track the removal of axes that can occur from indexing. For example data[2,2] would return an array of shape (128,128,3), or the third slice through the third volume in the dataset, but the returned array has lost the meaning associated with its axes, so saving it back out would require manual relabeling of the axes. I'd like to be able to track the axes as metadata and retain all the fancy numpy indexing. There are two ways I could accomplish this with minimal code on the python side: One would be if indexing of the array always returned an array of the same dimensionality, that is data[2,2] returned an array of shape (1,1,128,128,3). I could then delete the degenerate axes labels from the metadata, and return the compressed array, resulting in the same output: class Data(np.ndarray): def __getitem__(self,indices): data = np.ndarray.__getitem__(self,indices,donotcompress=True) # as an example data.axeslabels = [label for label,dim in zip(self.axeslabels,data.shape) if dim 1] return data.compress() def __getslice__(self,s1,s2,step): # trivial case Another approach would be if there is some function in the numpy internals that I could use to get the needed information before calling the ndarray's __getitem__ function: class Data(np.ndarray): def __getitem__(self,indices): unique = np.uniqueIndicesPerDimension(indices) data = np.ndarray.__getitem__(self,indices) data.axeslabels = [label for label,dim in zip(self.axeslabels, unique) if dim 1] return data Finally, I could implement my own parser for the passed indices to figure this out myself. This would be bad since I'd have to recreate a lot of the same code that must go on inside numpy, and it would be slower, error-prone, etc. : class Data(np.ndarray): def __getitem__(self,indices): indices = self.uniqueDimensionIndices(indices) data = np.ndarray.__getitem__(self,indices) data.axeslabels = [label for label,dim in zip(self.axeslabels,indices) if dim 1] return data def uniqueDimensionIndices(self,indices): if isinstance(indices,int): indices = (indices,) if isinstance(indices,tuple): elif isinstance(indices,list): ... Is there anything in the numpy internals already that would allow me to do #1 or #2?, I don't think #3 is a very good option. Thanks! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I would recommend joining or at least following the datarray project-- I believe it will do what you want, and we are working actively to build out this functionality. Here are some links: Datarray Github: https://github.com/fperez/datarray docs: http://fperez.github.com/datarray-doc/ Podcast we did about recent meeting about datarray http://inscight.org/2011/05/18/episode_13/ Other projects to consider using: larry and pandas-- these support data alignment which you may not care about. In pandas I'm only concerned with data with ndim = 3, a bit specific to statistics/econometrics/finance applications. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] datarray -- a couple of questions
On Sun, May 22, 2011 at 10:49 AM, Ben Walsh ben_w_...@yahoo.co.uk wrote: Hi I've been looking at various labelled-array packages, including Pandas (https://github.com/wesm/pandas) and datarray (https://github.com/fperez/datarray). I was quite interested in the design discussion for datarray, and I had a couple of questions: 1) Is this mailing list the correct place to ask about this? 2) Pandas supports mixed data types (eg. one column of floats, another column of strings). Does the datarray prototype support this? Is it simply a case of using a numpy.ndarray with dtype='object'? Cheers Ben ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion hi Ben, others can comment, but I think the answer to 1) is yes (if you have more specific usage questions about pandas feel free to ask on pystatsmodels, too). 2) datarray as currently conceived as a homogeneously typed object. So you could use dtype=object but occasionally have computational snafus. pandas tries to avoid these problems by segregating the floating point and non-floating point data. In recent discussions I do not think datarray is intended to be a step-in replacement for something like pandas, which is geared toward solving a bit more domain-specific set of problems (e.g. R-like data manipulations, time series / finance work, etc.). Rather, a future version of pandas will use datarray as its base layer. So if you're looking for an (R) data.frame - like container, pandas.{DataFrame, DataMatrix} is probably your best bet for now larry is also worth looking at. But like datarray is also does not do mixed-type data. cheers, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Bug? NumPy types circumvent some overloaded methods in ndarray subclasses
This strikes me as a bug-- haven't checked NumPy 1.6 yet but this happens in 1.5.1. Here's a toy example: class MyNdarray(np.ndarray): def __new__(cls, data): subarr = np.array(data, dtype=np.float64).view(cls) return subarr def __radd__(self, other): print 'hello' return (self.view(np.ndarray) + other).view(MyNdarray) def __add__(self, other): print 'hello' return (self.view(np.ndarray) + other).view(MyNdarray) In [25]: foo = MyNdarray(np.arange(10.)) In [26]: foo + 3 hello Out[26]: MyNdarray([ 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) In [27]: foo + np.float64(3) hello Out[27]: MyNdarray([ 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) In [28]: 3 + foo hello Out[28]: MyNdarray([ 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) In [29]: np.float64(3) + foo Out[29]: MyNdarray([ 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) I'm not sure if there's an easy workaround here. I thought maybe it was an array priority issue but after some tinkering it doesn't look like it. Any ideas? - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Data standardizing
On Wed, Apr 13, 2011 at 9:50 AM, Jonathan Rocher jroc...@enthought.com wrote: Hi, I assume you have this data in a txt file, correct? You can load up all of it in a numpy array using import numpy as np data = np.loadtxt(climat_file.txt, skiprows = 1) Then you can compute the mean you want by taking it on a slice of the data array. For example, if you want to compute the mean of your data in Jan for 1950-1970 (say including 1970) mean1950_1970 = data[1950:1971,1].mean() Then the std deviation you want could be computed using my_std = np.sqrt(np.mean((data[:,1]-mean1950_1970)**2)) Hope this helps, Jonathan On Tue, Apr 12, 2011 at 1:48 PM, Climate Research climatef...@gmail.com wrote: Hi I am purely new to python and numpy.. I am using python for doing statistical calculations to Climate data.. I have a data set in the following format.. Year Jan feb Mar Apr. Dec 1900 1000 1001 , , , 1901 1011 1012 , , , 1902 1009 1007 , , , ' , , , , , 2010 1008 1002 , , , I actually want to standardize each of these values with corresponding standard deviations for each monthly data column.. I have found out the standard deviations for each column.. but now i need to find the standared deviation only for a prescribed mean value ie, when i am finding the standared deviation for the January data column.. the mean should be calculated only for the january data, say from 1950-1970. With this mean i want to calculate the SD for entire column. Any help will be appreciated.. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Jonathan Rocher, PhD Scientific software developer Enthought, Inc. jroc...@enthought.com 1-512-536-1057 http://www.enthought.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion To standardize the data over each column you'll want to do: (data - data.mean(axis=0)) / data.std(axis=0, ddof=1) Note the broadcasting behavior of the (matrix - vector) operation--see NumPy documentation for more details. The ddof=1 is there to give you the (unbiased) sample standard deviation. shameless plug If you're looking for data structures to carry around your metadata (dates and month labels), look to pandas (my project: http://pandas.sourceforge.net/) or larry (http://larry.sourceforge.net/). /shameless plug - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Remove deprecated skiprows
On Tue, Apr 5, 2011 at 5:52 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Tue, Apr 5, 2011 at 11:45 PM, Skipper Seabold jsseab...@gmail.com wrote: On Sun, Apr 3, 2011 at 8:20 PM, Charles R Harris charlesr.har...@gmail.com wrote: Should skiprows be removed? if skiprows: warnings.warn(\ The use of `skiprows` is deprecated, it will be removed in numpy 2.0.\n \ Please use `skip_header` instead., DeprecationWarning) skip_header = skiprows Its been deprecated since 1.4. Personally, I like skiprows better than skip_header ;) +1 for skiprows. I always have to look it up. To me one is not much better than the other, but -1 for skiprows because un-deprecating it and deprecating skip_header is inconsistent and annoying for users. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion FYI: http://docs.python.org/library/warnings.html DeprecationWarning Base category for warnings about deprecated features (ignored by default). Maybe DeprecationWarnings should be changed to something that isn't ignored by default so users see it? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Remove deprecated skiprows
On Tue, Apr 5, 2011 at 7:32 PM, Robert Kern robert.k...@gmail.com wrote: On Tue, Apr 5, 2011 at 18:20, Wes McKinney wesmck...@gmail.com wrote: FYI: http://docs.python.org/library/warnings.html DeprecationWarning Base category for warnings about deprecated features (ignored by default). Maybe DeprecationWarnings should be changed to something that isn't ignored by default so users see it? The Python devs made DeprecationWarnings silent by default for a reason. They unnecessarily annoy and frighten application end users who don't know what they are, where they are coming from, or how to silence them. The recommended practice is for library users (who do or should know those things) to run their test suites with the warnings turned on. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion You make a very convincing point-- makes complete sense. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [Numpy-svn] aada93: ENH: Add 'subok' parameter to PyArray_NewLikeArray...
On Tue, Mar 15, 2011 at 3:25 PM, Sebastian Haase seb.ha...@gmail.com wrote: On Tue, Mar 15, 2011 at 7:22 PM, nore...@github.com wrote: Branch: refs/heads/master Home: https://github.com/numpy/numpy Commit: aada93306acfb4e2eb816faf32652edf8825cf45 https://github.com/numpy/numpy/commit/aada93306acfb4e2eb816faf32652edf8825cf45 Author: Mark Wiebe mwwi...@gmail.com Date: 2011-03-15 (Tue, 15 Mar 2011) Changed paths: M doc/source/reference/c-api.array.rst M numpy/add_newdocs.py M numpy/core/numeric.py M numpy/core/src/multiarray/convert.c M numpy/core/src/multiarray/ctors.c M numpy/core/src/multiarray/multiarraymodule.c M numpy/core/src/umath/ufunc_object.c M numpy/core/tests/test_numeric.py Log Message: --- ENH: Add 'subok' parameter to PyArray_NewLikeArray, np.empty_like, np.zeros_like, and np.ones_like This way, the sub-type can be avoided if necessary. This helps mitigate, but doesn't fix, ticket #1753, by allowing b = np.empty_like(a, subok=False). I'm really not in a position to comment on the depths of the numpy API, but my understanding of np.any_like( ... ) was that it would create always normal ndarrays just taking shape and dtype from the given array. So what should the interpretation of subok be ? Can you elaborate ... ? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I know that things like numpy.ma and pandas are directly impacted by this change-- in NumPy 1.6 many API functions cannot be used on subclasses because they discard any additional information you wish to be passed on (like the mask in a masked array, for example). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] broadcasting behavior for 1.6 (was: Numpy 1.6 schedule)
On Fri, Mar 11, 2011 at 9:57 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Mar 11, 2011 at 7:42 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Mar 11, 2011 at 2:01 AM, Ralf Gommers ralf.gomm...@googlemail.com wrote: I'm just going through the very long 1.6 schedule thread to see what is still on the TODO list before a 1.6.x branch can be made. So I'll send a few separate mails, one for each topic. On Mon, Mar 7, 2011 at 8:30 PM, Francesc Alted fal...@pytables.org wrote: A Sunday 06 March 2011 06:47:34 Mark Wiebe escrigué: I think it's ok to revert this behavior for backwards compatibility, but believe it's an inconsistent and unintuitive choice. In broadcasting, there are two operations, growing a dimension 1 - n, and appending a new 1 dimension to the left. The behaviour under discussion in assignment is different from normal broadcasting in that only the second one is permitted. It is broadcasting the output to the input, rather than broadcasting the input to the output. Suppose a has shape (20,), b has shape (1,20), and c has shape (20,1). Then a+b has shape (1,20), a+c has shape (20,20), and b+c has shape (20,20). If we do b[...] = a, a will be broadcast to match b by adding a 1 dimension to the left. This is reasonable and consistent with addition. If we do a[...]=b, under 1.5 rules, a will once again be broadcast to match b by adding a 1 dimension to the left. If we do a[...]=c, we could broadcast both a and c together to the shape (20,20). This results in multiple assignments to each element of a, which is inconsistent. This is not analogous to a+c, but rather to np.add(c, c, out=a). The distinction is subtle, but the inconsistent behavior is harmless enough for assignment that keeping backwards compatibility seems reasonable. For what is worth, I also like the behaviour that Mark proposes, and have updated tables test suite to adapt to this. But I'm fine if it is decided to revert to the previous behaviour. The conclusion on this topic, as I read the discussion, is that we need to keep backwards compatible behavior (even though the proposed change is more intuitive). Has backwards compatibility been fixed already? I don't think an official conclusion was reached, at least in so far as numpy has an official anything ;) But this change does show up as an error in one of the pandas tests, so it is likely to affect other folks as well. Probably the route of least compatibility hassle is to revert to the old behavior and maybe switch to the new behavior, which I prefer, for 2.0. That said, apart from pandas and pytables, and the latter has been fixed, the new behavior doesn't seem to have much fallout. I think it actually exposes unoticed assumptions in code that slipped by because there was no consequence. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I've fixed the pandas issue-- I'll put out a bugfix release whenever NumPy 1.6 final is out. I don't suspect it will cause very many problems (and those problems will--hopefully--be easy to fix). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] {zeros, empty, ...}_like functions behavior with subclasses
On Mon, Feb 28, 2011 at 10:52 PM, Wes McKinney wesmck...@gmail.com wrote: On Mon, Feb 28, 2011 at 7:24 PM, Pierre GM pgmdevl...@gmail.com wrote: On Mar 1, 2011, at 1:05 AM, Bruce Southey wrote: On Mon, Feb 28, 2011 at 4:52 PM, Wes McKinney wesmck...@gmail.com wrote: I'm having some trouble with the zeros_like function via np.fix: def zeros_like(a): if isinstance(a, ndarray): res = ndarray.__new__(type(a), a.shape, a.dtype, order=a.flags.fnc) res.fill(0) return res try: wrap = a.__array_wrap__ except AttributeError: wrap = None a = asarray(a) res = zeros(a.shape, a.dtype) if wrap: res = wrap(res) return res As you can see this is going to discard any metadata stored in a subtype. I'm not sure whether this is a bug or a feature but wanted to bring it up. Thanks, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I guess this is ticket 929. http://projects.scipy.org/numpy/ticket/929 I was looking at it today but was not sure what is really desired here. I considered that this just meant shape and dtype but not sure about masked or record arrays behavior. So: What is the value of having the metadata? What is the meaning of 'like' here? Well, that depends on what you wanna do, of course. To handle metadata, I use some kind of dictionary updated in the __array_finalize__. Check numpy.ma.MaskedArray and its subclasses (like scikits.timeseries.TimeSeries) for the details. Now that you could store some extra data in the dtype (if I remmbr and understand correctly), it might be worth considering a proper way to deal with that. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion The ticket is exactly related to the problem at hand-- having __array_finalize__ defined won't help you as it never gets called. Looks like this commit fixed the problem, so the ticket can be closed https://github.com/numpy/numpy/commit/c9d1849332ae5bf73299ea1268f6a55f78624688#numpy/core/numeric.py ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] {zeros, empty, ...}_like functions behavior with subclasses
I'm having some trouble with the zeros_like function via np.fix: def zeros_like(a): if isinstance(a, ndarray): res = ndarray.__new__(type(a), a.shape, a.dtype, order=a.flags.fnc) res.fill(0) return res try: wrap = a.__array_wrap__ except AttributeError: wrap = None a = asarray(a) res = zeros(a.shape, a.dtype) if wrap: res = wrap(res) return res As you can see this is going to discard any metadata stored in a subtype. I'm not sure whether this is a bug or a feature but wanted to bring it up. Thanks, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] {zeros, empty, ...}_like functions behavior with subclasses
On Mon, Feb 28, 2011 at 7:24 PM, Pierre GM pgmdevl...@gmail.com wrote: On Mar 1, 2011, at 1:05 AM, Bruce Southey wrote: On Mon, Feb 28, 2011 at 4:52 PM, Wes McKinney wesmck...@gmail.com wrote: I'm having some trouble with the zeros_like function via np.fix: def zeros_like(a): if isinstance(a, ndarray): res = ndarray.__new__(type(a), a.shape, a.dtype, order=a.flags.fnc) res.fill(0) return res try: wrap = a.__array_wrap__ except AttributeError: wrap = None a = asarray(a) res = zeros(a.shape, a.dtype) if wrap: res = wrap(res) return res As you can see this is going to discard any metadata stored in a subtype. I'm not sure whether this is a bug or a feature but wanted to bring it up. Thanks, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I guess this is ticket 929. http://projects.scipy.org/numpy/ticket/929 I was looking at it today but was not sure what is really desired here. I considered that this just meant shape and dtype but not sure about masked or record arrays behavior. So: What is the value of having the metadata? What is the meaning of 'like' here? Well, that depends on what you wanna do, of course. To handle metadata, I use some kind of dictionary updated in the __array_finalize__. Check numpy.ma.MaskedArray and its subclasses (like scikits.timeseries.TimeSeries) for the details. Now that you could store some extra data in the dtype (if I remmbr and understand correctly), it might be worth considering a proper way to deal with that. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion The ticket is exactly related to the problem at hand-- having __array_finalize__ defined won't help you as it never gets called. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible). Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion As an aside, do you have a GPU-equipped machine? This function looks like it would be an easy CUDA target. Depends on who the users of the function are (e.g. do they also have the CUDA runtime) if whether you wanted to go that route. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote: Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do some easy(!) CUDA programming here, maybe you could guide me into the right direction... Thanks, Sebastian. On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote: On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For PAPI, you have to install several packages (perf module for kernel for instance) and a GUI to analyze the results (in Eclispe, it should be possible). Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...) Matthieu 2011/2/15 Sebastian Haase seb.ha...@gmail.com Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms vs. 50ms for arrays of length 329 and 340). I'm using gcc on Linux. Now I'm wondering if I could go even faster !? My hope that the compiler might automagically do some SSE2 optimization got disappointed. Since I have a 4 core CPU I thought OpenMP might be an option; I never used that, and after some playing around I managed to get (only) 50% slowdown(!) :-( My code in short is this: (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) ---Ccode -- void dists2d( double *a_ps, int nx1, int na, double *b_ps, int nx2, int nb, double *dist, int nx3, int ny3) throw (char*) { if(nx1 != 2) throw (char*) a must be of shape (n,2); if(nx2 != 2) throw (char*) b must be of shape (n,2); if(nx3 != nb || ny3 != na) throw (char*) c must be of shape (na,nb); double *dist_; int i, j; #pragma omp parallel private(dist_, j, i) { #pragma omp for nowait for(i=0;ina;i++) { //num_threads=omp_get_num_threads(); -- 4 dist_ = dist+i*nb; // dists_ is only introduced for OpenMP for(j=0;jnb;j++) { *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + sq(a_ps[i*nx1+1] - b_ps[j*nx2+1]) ); } } } } ---/Ccode -- There is probably a simple mistake in this code - as I said I never used OpenMP before. It should be not too difficult to use OpenMP correctly here or - maybe better - is there a simple SSE(2,3,4) version that might be even better than OpenMP... !? I supposed, that I did not get the #pragma omp lines right - any idea ? Or is it in general not possible to speed this kind of code up using OpenMP !? Thanks, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion As an aside, do you have a GPU-equipped machine? This function looks like it would be an easy CUDA target. Depends on who the users of the function are (e.g. do they also have the CUDA runtime) if whether you wanted to go that route
Re: [Numpy-discussion] relational join
On Wed, Feb 2, 2011 at 4:46 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Feb 2, 2011 at 21:42, Ilya Shlyakhter ilya_...@alum.mit.edu wrote: Does numpy have a relational join operation for joining recordarrays? [~] |1 from numpy.lib import recfunctions [~] |2 recfunctions.join_by? Type: function Base Class: type 'function' String Form: function join_by at 0x17bba30 Namespace: Interactive File: /Library/Frameworks/Python.framework/Versions/6.3/lib/python2.6/site-packages/numpy/lib/recfunctions.py Definition: recfunctions.join_by(key, r1, r2, jointype='inner', r1postfix='1', r2postfix='2', defaults=None, usemask=True, asrecarray=False) Docstring: Join arrays `r1` and `r2` on key `key`. The key should be either a string or a sequence of string corresponding to the fields used to join the array. An exception is raised if the `key` field cannot be found in the two input arrays. Neither `r1` nor `r2` should have any duplicates along `key`: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm. Parameters -- key : {string, sequence} A string or a sequence of strings corresponding to the fields used for comparison. r1, r2 : arrays Structured arrays. jointype : {'inner', 'outer', 'leftouter'}, optional If 'inner', returns the elements common to both r1 and r2. If 'outer', returns the common elements as well as the elements of r1 not in r2 and the elements of not in r2. If 'leftouter', returns the common elements and the elements of r1 not in r2. r1postfix : string, optional String appended to the names of the fields of r1 that are present in r2 but absent of the key. r2postfix : string, optional String appended to the names of the fields of r2 that are present in r1 but absent of the key. defaults : {dictionary}, optional Dictionary mapping field names to the corresponding default values. usemask : {True, False}, optional Whether to return a MaskedArray (or MaskedRecords is `asrecarray==True`) or a ndarray. asrecarray : {False, True}, optional Whether to return a recarray (or MaskedRecords if `usemask==True`) or just a flexible-type ndarray. Notes - * The output is sorted along the key. * A temporary array is formed by dropping the fields not in the key for the two arrays and concatenating the result. This array is then sorted, and the common entries selected. The output is constructed by filling the fields with the selected entries. Matching is not preserved if there are some duplicates... -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion It might also be worth your while to check out Keith Goodman's la (larry) library or my pandas library, which are both designed with relational data in mind. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote: On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote: Keith (and others), What would you think about creating a library of mostly Cython-based domain specific functions? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase. I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc. I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package? Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake. I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions
On Sun, Nov 21, 2010 at 6:02 PM, josef.p...@gmail.com wrote: On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman kwgood...@gmail.com wrote: On Sun, Nov 21, 2010 at 12:30 PM, josef.p...@gmail.com wrote: On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman kwgood...@gmail.com wrote: On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wesmck...@gmail.com wrote: On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman kwgood...@gmail.com wrote: On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wesmck...@gmail.com wrote: Keith (and others), What would you think about creating a library of mostly Cython-based domain specific functions? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase. I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc. I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package? Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake. I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way. A package focused on NaN-aware functions sounds like a good idea. I think a good plan would be to start by making faster, drop-in replacements for the NaN functions that are already in numpy and scipy. That is already a lot of work. After that, one possibility is to add stuff like nancumsum, nanprod, etc. After that moving window stuff? and maybe group functions after that? Yes, group functions are on my list. If there is a lot of repetition, you could use templating. Even simple string substitution, if it is only replacing the dtype, works pretty well. It would at least reduce some copy-paste. Unit test coverage should be good enough to mess around with trying templating. What's a good way to go? Write my own script that creates the .pyx file and call it from the make file? Or are there packages for doing the templating? Depends on the scale, I tried once with simple string templates http://codespeak.net/pipermail/cython-dev/2009-August/006614.html here is a pastbin of another version by (?), http://pastebin.com/f1a49143d discussed on the cython-dev mailing list. The cython list has the discussion every once in a while but I haven't seen any conclusion yet. For heavier duty templating a proper templating package (Jinja?) might be better. I'm not an expert. Josef I added nanmean (the first scipy function to enter nanny) and nanmin. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion What would you say to a single package that contains: - NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.) - moving window functions (moving_{count, sum, mean, var, std, etc.}) - core subroutines for labeled data - group-by functions - other things to add to this list? In other words, basic building computational tools for making libraries like larry, pandas, etc
Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions
On Sun, Nov 21, 2010 at 6:37 PM, Keith Goodman kwgood...@gmail.com wrote: On Sun, Nov 21, 2010 at 3:16 PM, Wes McKinney wesmck...@gmail.com wrote: What would you say to a single package that contains: - NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.) I'd say yes. - moving window functions (moving_{count, sum, mean, var, std, etc.}) Yes. BTW, we both do arr=arr.astype(float), I think, before doing the moving statistics. So I speeded things up by running the moving window backwards and writing the result in place. - core subroutines for labeled data Not sure what this would be. Let's discuss. Basically want to produce a indexing vector based on rules-- something to pass to ndarray.take later on. And maybe your generic binary-op function from a while back? - group-by functions Yes. I have some ideas on function signatures. - other things to add to this list? A no-op function with a really long doc string! In other words, basic building computational tools for making libraries like larry, pandas, etc. and doing time series / statistical / other manipulations on real world (messy) data sets. The focus isn't so much NaN-awareness per se but more practical data wrangling. I would be happy to work on such a package and to move all the Cython code I've written into it. There's a little bit of datarray overlap potentially but I think that's OK Maybe we should make a list of function signatures along with brief doc strings to get a feel for what we (and hopefully others) have in mind? I've personally never been much for writing specs, but could be useful. We probably aren't going to get it all right on the first try, so we'll just do our best and refactor the code later if necessary. We might be well-served by collecting exemplary data sets and making a list of things we would like to be able to do easily with that data. But writing stuff like: moving_{funcname}(ndarray data, int window, int axis=0, int min_periods=window) - ndarray group_aggregate(ndarray data, ndarray labels, int axis=0, function agg_function) - ndarray group_transform(...) ... etc. makes sense Where should we continue the discussion? The pystatsmodels mailing list? By now the numpy list probably thinks of NaN as Not ANother email from this guy. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Maybe let's have the next thread on SciPy-user-- I think what we're talking about is general enough to be discussed there. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions
On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman kwgood...@gmail.com wrote: On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com wrote: I should make a benchmark suite. ny.benchit(verbose=False) Nanny performance benchmark Nanny 0.0.1dev Numpy 1.4.1 Speed is numpy time divided by nanny time NaN means all NaNs Speed Test Shape dtype NaN? 6.6770 nansum(a, axis=-1) (500,500) int64 4.6612 nansum(a, axis=-1) (1,) float64 9.0351 nansum(a, axis=-1) (500,500) int32 3.0746 nansum(a, axis=-1) (500,500) float64 11.5740 nansum(a, axis=-1) (1,) int32 6.4484 nansum(a, axis=-1) (1,) int64 51.3917 nansum(a, axis=-1) (500,500) float64 NaN 13.8692 nansum(a, axis=-1) (1,) float64 NaN 6.5327 nanmax(a, axis=-1) (500,500) int64 8.8222 nanmax(a, axis=-1) (1,) float64 0.2059 nanmax(a, axis=-1) (500,500) int32 6.9262 nanmax(a, axis=-1) (500,500) float64 5.0688 nanmax(a, axis=-1) (1,) int32 6.5605 nanmax(a, axis=-1) (1,) int64 48.4850 nanmax(a, axis=-1) (500,500) float64 NaN 14.6289 nanmax(a, axis=-1) (1,) float64 NaN You can also use the makefile to run the benchmark: make bench ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Keith (and others), What would you think about creating a library of mostly Cython-based domain specific functions? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions
On Sat, Nov 20, 2010 at 6:54 PM, Wes McKinney wesmck...@gmail.com wrote: On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman kwgood...@gmail.com wrote: On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman kwgood...@gmail.com wrote: I should make a benchmark suite. ny.benchit(verbose=False) Nanny performance benchmark Nanny 0.0.1dev Numpy 1.4.1 Speed is numpy time divided by nanny time NaN means all NaNs Speed Test Shape dtype NaN? 6.6770 nansum(a, axis=-1) (500,500) int64 4.6612 nansum(a, axis=-1) (1,) float64 9.0351 nansum(a, axis=-1) (500,500) int32 3.0746 nansum(a, axis=-1) (500,500) float64 11.5740 nansum(a, axis=-1) (1,) int32 6.4484 nansum(a, axis=-1) (1,) int64 51.3917 nansum(a, axis=-1) (500,500) float64 NaN 13.8692 nansum(a, axis=-1) (1,) float64 NaN 6.5327 nanmax(a, axis=-1) (500,500) int64 8.8222 nanmax(a, axis=-1) (1,) float64 0.2059 nanmax(a, axis=-1) (500,500) int32 6.9262 nanmax(a, axis=-1) (500,500) float64 5.0688 nanmax(a, axis=-1) (1,) int32 6.5605 nanmax(a, axis=-1) (1,) int64 48.4850 nanmax(a, axis=-1) (500,500) float64 NaN 14.6289 nanmax(a, axis=-1) (1,) float64 NaN You can also use the makefile to run the benchmark: make bench ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Keith (and others), What would you think about creating a library of mostly Cython-based domain specific functions? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase. - Wes By the way I wouldn't mind pushing all of my datetime-related code (date range generation, date offsets, etc.) into this new library, too. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Anyone with Core i7 and Ubuntu 10.04?
On Mon, Nov 8, 2010 at 11:33 PM, David Warde-Farley warde...@iro.umontreal.ca wrote: On 2010-11-08, at 8:52 PM, David wrote: Please tell us what error you got - saying that something did not working is really not useful to help you. You need to say exactly what fails, and which steps you followed before that failure. I think what he means is that it's very slow, there's no discernable error but dot-multiplies don't seem to be using BLAS. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Somewhat related topic: anyone know the status of EPD (Enthought distribution) releases on i7 processors as far as this goes? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Financial TS models
On Sat, Sep 18, 2010 at 10:33 AM, josef.p...@gmail.com wrote: On Sat, Sep 18, 2010 at 10:13 AM, josef.p...@gmail.com wrote: On Sat, Sep 18, 2010 at 8:09 AM, Virgil Stokes v...@it.uu.se wrote: I am considering the development of an all Python package (with numpy and matplotlib) for the modeling and analysis of financial time series. This is a rather ambitious and could take sometime before I can have something that is useful. Thus, any suggestions, pointers, etc. to related work would be appreciated. I should have just asked you: What do you have in mind? instead of writing my notes from the readings I just did into the email. I think any open source contributions in this area will be very useful with the increased popularity of python in Finance. Josef Depends on what you want to do, but I would join or build on top of an existing package. I just got distracted with an extended internet search after finding http://github.com/quantmind/dynts (They use Redis as an in-memory and persistent storage. After reading up a bit, I think this might be useful if you have a web front end http://github.com/lsbardel/jflow in mind, but maybe not as good as hdf5 for desktop work. Just guessing since I used neither, and I always worry about installation problems on Windows.) They just started public development but all packages are in BSD from what I have seen. Otherwise, I would build on top of pandas, scikits.timeseries or larry or tabular if you want to handle your own time variable. For specific financial time series, e.g. stocks, exchange rates, options, I have seen only bits and pieces, or only partially implemented code (with a BSD compatible license), outside of quantlib and it's python bindings. Maybe someone knows more about what's available. For the econometrics/statistical analysis I haven't seen much outside of pandas and statsmodels in this area (besides isolated examples and recipes). I started to write on this in the statsmodels sandbox (including simulators). modeling and analysis of financial time series is a big project, and to get any results within a reasonable amount of time (unless you are part of a large team) is to specialize on some pieces. This is just my impression, since I thought of doing the same thing, but didn't see any way to get very far. (I just spend some weekends just to get the data from the Federal Reserve and wrap the API for the economics data base (Fred) of the Federal Reserve Saint Louis, the boring storage backend is zipped csv-files) Josef Thank you, --V ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I'm also interested as well what you might have in mind. For econometric models, the place to start (and perhaps contribute) would definitely be statsmodels. For other models it depends. Over the next year or so I plan to be working as often as I can on implementing time series models either inside statsmodels (with a pandas interface so you can carry around metadata) or inside pandas itself (which is not necessarily intended for financial applications, but that's what I've used it for). I'm interested in both standard econometric models (see e.g. Lütkepohl's 2006 time series book) and Bayesian time series models. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy installation question (IronPython)
On Tue, Jul 13, 2010 at 8:26 AM, Sebastian Haase seb.ha...@gmail.com wrote: On Tue, Jul 13, 2010 at 2:20 PM, William Johnston willi...@tenbase2.com wrote: Hello, I simply installed numpy in my Python26 installation, and then copied the numpy directory to my site-packages folder of my IronPython installation. Did I miss any installation steps in doing so? The multiarray module could not be found using IronPython. Hi William, Why do you think that numpy works in IronPython ? I thought most Python modules work only with standard (C) Python Numpy depends heavily on C implementations for most of its functionality. Regards, Sebastian Haase ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion William-- I think you will want to look here to be able to use NumPy in IronPython for the time being: http://code.google.com/p/ironclad/ ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Memory leak found in ndarray (I think)?
On Mon, Jul 12, 2010 at 6:44 PM, Nathaniel Peterson nathanielpeterso...@gmail.com wrote: Wes McKinney wesmck...@gmail.com writes: Did you mean to post a different link? That's the ticket I just created :) How silly of me! I meant http://projects.scipy.org/numpy/ticket/1427 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Looks to be caused by the same bug. I would suggest this should get bumped up in priority for a fix? I am going to have a poke around the C codebase but I don't feel competent enough to patch it myself. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Memory leak found in ndarray (I think)?
This one was quite a bear to track down, starting from the of course very high level observation of why is my application leaking memory. I've reproduced it on Windows XP using NumPy 1.3.0 on Python 2.5 and 1.4.1 on Python 2.6 (EPD). Basically it seems that calling .astype(bool) on an ndarray slice with object dtype is leaving a hanging reference count, should be pretty obvious to see: from datetime import datetime import numpy as np import sys def foo(verbose=True): arr = np.array([datetime.today() for _ in xrange(1000)]) arr = arr.reshape((500, 2)) sl = arr[:, 0] if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0]) for _ in xrange(10): foo = sl.astype(bool) if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0]) if __name__ == '__main__': foo() for i in xrange(1): if not i % 1000: print i foo(verbose=False) On my machine this bleeds about 100 MB of memory that you don't get back-- let me know if I've misinterpreted the results. I'll happily create a ticket on the Trac page. Thanks, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Memory leak found in ndarray (I think)?
On Mon, Jul 12, 2010 at 2:22 PM, Wes McKinney wesmck...@gmail.com wrote: This one was quite a bear to track down, starting from the of course very high level observation of why is my application leaking memory. I've reproduced it on Windows XP using NumPy 1.3.0 on Python 2.5 and 1.4.1 on Python 2.6 (EPD). Basically it seems that calling .astype(bool) on an ndarray slice with object dtype is leaving a hanging reference count, should be pretty obvious to see: from datetime import datetime import numpy as np import sys def foo(verbose=True): arr = np.array([datetime.today() for _ in xrange(1000)]) arr = arr.reshape((500, 2)) sl = arr[:, 0] if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0]) for _ in xrange(10): foo = sl.astype(bool) if verbose: print 'Rec ct of index 0: %d' % sys.getrefcount(sl[0]) if __name__ == '__main__': foo() for i in xrange(1): if not i % 1000: print i foo(verbose=False) On my machine this bleeds about 100 MB of memory that you don't get back-- let me know if I've misinterpreted the results. I'll happily create a ticket on the Trac page. Thanks, Wes Posted in Ticket #1542 http://projects.scipy.org/numpy/ticket/1542 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Memory leak found in ndarray (I think)?
On Mon, Jul 12, 2010 at 3:39 PM, Nathaniel Peterson nathanielpeterso...@gmail.com wrote: This memory leak may be related: http://projects.scipy.org/numpy/ticket/1542 It shows what appears to be a memory leak when calling astype('float') on an array of dtype 'object'. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Did you mean to post a different link? That's the ticket I just created :) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
On Thu, Jul 8, 2010 at 1:35 PM, Rob Speer rsp...@mit.edu wrote: Forgive me if this is has already been addressed, but my question is what happens when we have more than one label (not as in a labeled axis but an observation label -- but not a tick because they're not unique!) per say row axis and heterogenous dtypes. This is really the problem that I would like to see addressed and from the BoF comments I'm not sure this use case is going to be covered. I'm also not sure I expressed myself clearly enough or understood what's already available. For me, this is the single most common use case and most of what we are talking about now is just convenient slicing but ignoring some basic and prominent concerns. Please correct me if I'm wrong. I need to play more with DataArray implementation but haven't had time yet. I often have data that looks like this (not really, but it gives the idea in a general way I think). city, month, year, region, precipitation, temperature Austin, January, 1980, South, 12.1, 65.4, Austin, February, 1980, South, 24.3, 55.4 Austin, March, 1980, South, 3, 69.1 Austin, December, 2009, 1, 62.1 Boston, January, 1980, Northeast, 1.5, 19.2 Boston,December, 2009, Northeast, 2.1, 23.5 ... Memphis,January,1980, South, 2.1, 35.6 ... Memphis,December,2009, South, 1.2, 33.5 ... Your labels are unique if you look at them the right way. Here's how I would represent that in a datarray: * axis0 = 'city', ['Austin', 'Boston', ...] * axis1 = 'month', ['January', 'February', ...] * axis2 = 'year', [1980, 1981, ...] * axis3 = 'region', ['Northeast', 'South', ...] * axis4 = 'measurement', ['precipitation', 'temperature'] and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2, axis3, axis4]. Now I realize not everyone wants to represent their tabular data as a big tensor that they index every which way, and I think this is one thing that pandas is for. Oh, and the other problem with the 5-D datarray is that you'd probably want it to be sparse. This is another discussion worth having. I have thought quite a bit about the sparsity problem as well. I took a first crack at a sparse data structure for panel (3D) data called LongPanel, so basically each row has two labels, and you can fairly efficiently convert to the dense (3D) form. It's also capable of constructing dummy variables for a fixed effects regression. There of course per Skipper's question you will have nearly always have duplicate labels-- I bet it's something we could generalize. It's also very much related to the group-by procedures we've discussed. I want to eventually replace the labeling stuff in Divisi with datarray, but sparse matrices are largely the point of using Divisi. So how do we make a sparse datarray? One answer would be to have datarray be a wrapper that encapsulates any sufficiently matrix-like type. This is approximately what I did in the now-obsolete Divisi1. Nobody liked the fact that you had to wrap and unwrap your arrays to accomplish anything that we hadn't thought of in writing Divisi. I would not recommend this route. The other option, which is more like Divisi2. would be to provide the functionality of datarray using a mixin. Then a standard dense datarray could inherit from (np.ndarray, Datarray), while a sparse datarray could inherit from (sparse.csr_matrix, Datarray), for example. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
On Tue, Jul 6, 2010 at 12:56 PM, Keith Goodman kwgood...@gmail.com wrote: On Tue, Jul 6, 2010 at 9:52 AM, Joshua Holbrook josh.holbr...@gmail.com wrote: On Tue, Jul 6, 2010 at 8:42 AM, Skipper Seabold jsseab...@gmail.com wrote: On Tue, Jul 6, 2010 at 12:36 PM, Joshua Holbrook josh.holbr...@gmail.com wrote: I'm kinda-sorta still getting around to building/reading the sphinx docs for datarray. _ Like, I've gone through them before, but it was more cursory than I'd like. Honestly, I kinda let myself get caught up in trying to automate the process of getting them onto github pages. I have to admit that I didn't 100% understand the reasoning behind not allowing integer ticks (I blame jet lag--it's a nice scapegoat). I believe it originally had to do with what you meant if you typed, say, A[3:london]; Did you mean the underlying ndarray index 3, or the outer level tick 3? I think if you didn't allow integers, then you could simply wrap your 3 in a string: A[3:London] so it's probably not a deal-breaker, but I would imagine that using (a) separate method(s) for label-based indexing may make allowing integer-datatyped labels. Thoughts? Would you mind bottom-posting/ posting in-line to make the thread easier to follow? --Josh On Tue, Jul 6, 2010 at 8:23 AM, Keith Goodman kwgood...@gmail.com wrote: On Tue, Jul 6, 2010 at 9:13 AM, Skipper Seabold jsseab...@gmail.com wrote: On Tue, Jul 6, 2010 at 11:55 AM, Keith Goodman kwgood...@gmail.com wrote: On Tue, Jul 6, 2010 at 7:47 AM, Joshua Holbrook josh.holbr...@gmail.com wrote: I really really really want to work on this. I already forked datarray on github and did some research on What Other People Have Done ( http://jesusabdullah.github.com/2010/07/02/datarray.html ). With any luck I'll contribute something actually useful. :) I like the figure! To do label indexing on a larry you need to use lix, so lar.lix[...] FYI, if you didn't see it, there are also usage docs in dataarray/doc that you can build with sphinx that show a lot of the thinking and examples (they spent time looking at pandas and larry). One question that was asked of Wes, that I'd propose to you as well Keith, is that if DataArray became part of NumPy, do you think you could use it to work on top of for larry? This is all very exciting. I did not know that DataArray had ticks so I never took a close look at it. After reading the sphinx doc, one question I had was how firm is the decision to not allow integer ticks? I use int ticks a lot. I think what Josh said is right. However, we proposed having all of the new labeled axis access pushed to a .aix (or whatever) method, so as to avoid any confusion, as the original object can be accessed just as an ndarray. I'm not sure where this leaves us vis-a-vis ints as ticks. Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Sorry re: posting at-top. I guess habit surpassed observation of community norms for a second there. Whups! My opinion on the matter is that, as a matter of purity, labels should all have the string datatype. That said, I'd imagine that passing an int as an argument would be fine, due to python's loosey-goosey attitude towards datatypes. :) That, or, y'know, str(myint). Ideally (for me), the only requirement for ticks would be hashable and unique along any one axis. So, for example, datetime.date() could be a tick but a list could not be a tick (not hashable). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Gmail needs to really get its act and enable bottom-posting by default. Definitely an annoyance There are many issues at play here so I wanted to give some of my thoughts re: building pandas, larry, etc. on top of DataArray (or whatever it is that makes its way into NumPy), can put this on the wiki, too: 1. Giving semantic information to axes (not ticks, though) I think this is very useful but wouldn't be immediately useful in pandas except perhaps moving axis names elsewhere (which are currently a part of the data-structures and always have the same name). I wouldn't be immediately comfortable say, making a pandas DataFrame a subclass of DataArray and making them implicitly interoperable. Going back and forth e.g. from DataArray and DataFrame *should* be an easy operation-- you could imagine using DataArray to serialize both pandas and larry objects for example! 2. Container for axis metadata (Axis object in datarray, Index in pandas, ...) I would be more than happy to offload the ordered set data structure onto NumPy. In pandas, Index is that container-- it's an ndarray subclass with a handful of methods and a reverse index (e.g. if you have ['d', 'b', 'a' 'c'] you have a dict somewhere with {'d' : 0, 'b' : 1,
Re: [Numpy-discussion] indexing question
On Mon, Jun 21, 2010 at 7:10 PM, Robert Kern robert.k...@gmail.com wrote: On Mon, Jun 21, 2010 at 17:42, Neal Becker ndbeck...@gmail.com wrote: Robert Kern wrote: On Mon, Jun 21, 2010 at 14:01, Neal Becker ndbeck...@gmail.com wrote: Can I find an efficient way to do this? I have a 2d array, A, 80 rows by 880 columns. I have a vector, B, of length 80, with scalar indexes. I assume you mean 880. I want a vector output C where C[i] = A[b[i],i] (i=0,879) C = A[b, np.arange(880)] Thanks! Just what I needed. I wouldn't have guessed this. Do we have a wiki to save useful examples like this? This kind of indexing is documented here: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#integer Various examples are on the wiki here: http://www.scipy.org/Cookbook/Indexing -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion And to be completely pedantic one could use ndarray.take to do this about 3-4x faster (but less intuitively): A.take(b * 880 + np.arange(880)) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Tools / data structures for statistical analysis and related applications
On Fri, Jun 11, 2010 at 9:46 AM, Bruce Southey bsout...@gmail.com wrote: On 06/09/2010 03:40 PM, Wes McKinney wrote: Dear all, We've been having discussions on the pystatsmodels mailing list recently regarding data structures and other tools for statistics / other related data analysis applications. I believe we're trying to answer a number of different, but related questions: 1. What are the sets of functionality (and use cases) which would be desirable for the scientific (or statistical) Python programmer? Things like groupby (http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst) fall into this category. 2. Do we really need to build custom data structures (larry, pandas, tabular, etc.) or are structured ndarrays enough? (My conclusion is that we do need to, but others might disagree). If so, how much performance are we willing to trade for functionality? 3. What needs to happen for Python / NumPy / SciPy to really break in to the statistical computing field? In other words, could a Python-based stack one day be a competitive alternative to R? These are just some ideas for collecting community input. Of course as we're all working in different problem domains, the needs of users will vary quite a bit across the board. We've started to collect some thoughts, links, etc. on the scipy.org wiki: http://scipy.org/StatisticalDataStructures A lot of what's there already is commentary and comparison on the functionality provided by pandas and la / larry (since Keith and I wrote most of the stuff there). But I think we're trying to identify more generally the things that are lacking in NumPy/SciPy and related libraries for particular applications. At minimum it should be good fodder for the SciPy conferences this year and afterward (I am submitting a paper on this subject based on my experiences). - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion If you need pure data storage then all you require is an timeseries, masked structured ndarray. That will handle time/dates, missing values and named variables. This is probably the basis of all statistical packages, databases and spreadsheets. But the real problem is the blas/lapack usage that prevents anything but an standard narray. For storing data sets I can agree that a structured / masked ndarray is sufficient. But I think a lot of people are primarily concerned about data manipulations in memory (which can be currently quite obtuse). If you are referring to scikits.timeseries-- it expects data to be fixed frequency which is a too rigid assumption for many applications (like mine). The issue that I have with all these packages like tabulate, la and pandas that extend narrays is the 'upstream'/'downstream' problem of open source development. The real problem with these extensions of numpy is that while you can have whatever storage you like, you either need to write your own functions or preprocess the storage into an acceptable form. So you have to rely on those extensions being update with numpy/scipy since a 'fix' upstream can cause havoc downstream. I In theory this could be a problem but of all packages to depend on in the Python ecosystem, NumPy seems pretty safe. How many API breakages have there been in ndarray in the last few years? Inherently this is a risk of participating in open-source. After more than 2 years of running a NumPy-SciPy based stack in production applications I feel pretty comfortable. And besides, we write unit tests for a reason, right? subscribe to what other have said elsewhere in the open source community in that it is very important to get your desired features upstream to the original project source - preferably numpy but scipy also counts. From my experience developing pandas it's not clear to me what I've done that _should_ make its way upstream into NumPy and / or SciPy. You could imagine some form of high-level statistical data structure making its way into scipy.stats but I'm not sure. If NumPy could incorporate something like R's NA value without substantially degrading performance then that would be a boon to the issue of handling missing data (which MaskedArray does do for us-- but at non-trivial performance loss). Data alignment routines, groupby (which is already on the table), and NaN / missing data sensitive moving window functions (mean, median, std, etc.) would be nice general additions as well. Any other ideas? Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Tools / data structures for statistical analysis and related applications
Dear all, We've been having discussions on the pystatsmodels mailing list recently regarding data structures and other tools for statistics / other related data analysis applications. I believe we're trying to answer a number of different, but related questions: 1. What are the sets of functionality (and use cases) which would be desirable for the scientific (or statistical) Python programmer? Things like groupby (http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst) fall into this category. 2. Do we really need to build custom data structures (larry, pandas, tabular, etc.) or are structured ndarrays enough? (My conclusion is that we do need to, but others might disagree). If so, how much performance are we willing to trade for functionality? 3. What needs to happen for Python / NumPy / SciPy to really break in to the statistical computing field? In other words, could a Python-based stack one day be a competitive alternative to R? These are just some ideas for collecting community input. Of course as we're all working in different problem domains, the needs of users will vary quite a bit across the board. We've started to collect some thoughts, links, etc. on the scipy.org wiki: http://scipy.org/StatisticalDataStructures A lot of what's there already is commentary and comparison on the functionality provided by pandas and la / larry (since Keith and I wrote most of the stuff there). But I think we're trying to identify more generally the things that are lacking in NumPy/SciPy and related libraries for particular applications. At minimum it should be good fodder for the SciPy conferences this year and afterward (I am submitting a paper on this subject based on my experiences). - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items()) return averaged I had to get the latest scipy trunk to not get an error from ndimage.mean ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was curious about how fast ndimage was for this operation so here's the complete function. import scipy.ndimage as ndi N = 1 lat = np.random.randint(0, 360, N) lon = np.random.randint(0, 360, N) data = np.random.randn(N) def group_mean(lat, lon, data): indexer = np.lexsort((lon, lat)) lat = lat.take(indexer) lon = lon.take(indexer) sorted_data = data.take(indexer) keys = 1000 * lat + lon unique_keys = np.unique(keys) result = ndi.mean(sorted_data, labels=keys, index=unique_keys) decoder = keys.searchsorted(unique_keys) return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) Appears to be about 13x faster (and could be made faster still) than the naive version on my machine: def group_mean_naive(lat, lon, data): grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da
Re: [Numpy-discussion] 2D binning
On Wed, Jun 2, 2010 at 2:26 PM, josef.p...@gmail.com wrote: On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote: I'm on Windows, using a precompiled binary. I never built numpy/scipy on Windows. ndimage measurements has been recently rewritten. ndimage is very fast but (the old version) has insufficient type checking and may crash on wrong inputs. I managed to work with it in the past on Windows. Maybe you could try to check the dtypes of the arguments for ndi.mean. (Preferably in an interpreter session where you don't mind if it crashes) I don't remember the type restrictions, but there are/were several tickets for it. Josef On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote: thanks. I am also getting an error in ndi.mean Were you getting the error RuntimeError: data type not supported? -Mathew On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote: On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote: On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote: On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu wrote: I guess it's as fast as I'm going to get. I don't really see any other way. BTW, the lat/lons are integers) You could (in c or cython) try a brain-dead hashtable with no collision detection: for lat, long, data in dataset: bin = (lat ^ long) % num_bins hashtable[bin] = update_incremental_mean(hashtable[bin], data) you'll of course want to do some experiments to see if your data are sufficiently sparse and/or you can afford a large enough hashtable array that you won't get spurious hash collisions. Adding error- checking to ensure that there are no collisions would be pretty trivial (just keep a table of the lat/long for each hash value, which you'll need anyway, and check that different lat/long pairs don't get assigned the same bin). Zach -Mathew On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu wrote: Hi Can anyone think of a clever (non-lopping) solution to the following? A have a list of latitudes, a list of longitudes, and list of data values. All lists are the same length. I want to compute an average of data values for each lat/lon pair. e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then data[1001] = (data[1001] + data[2001])/2 Looping is going to take wa to long. As a start, are the equal lat/lon pairs exactly equal (i.e. either not floating-point, or floats that will always compare equal, that is, the floating-point bit-patterns will be guaranteed to be identical) or approximately equal to float tolerance? If you're in the approx-equal case, then look at the KD-tree in scipy for doing near-neighbors queries. If you're in the exact-equal case, you could consider hashing the lat/ lon pairs or something. At least then the looping is O(N) and not O(N^2): import collections grouped = collections.defaultdict(list) for lt, ln, da in zip(lat, lon, data): grouped[(lt, ln)].append(da) averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) Is that fast enough? If the lat lon can be converted to a 1d label as Wes suggested, then in a similar timing exercise ndimage was the fastest. http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html And as you said your lats and lons are integers, you could simply do ll = lat*1000 + lon to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or lon will never exceed 360 (degrees). After that, either use the ndimage approach, or you could use histogramming with weighting by data values and divide by histogram withouth weighting, or just loop. Vincent (this was for python 2.4, also later I found np.bincount which requires that the labels are consecutive integers, but is as fast as ndimage) I don't know how it would compare to the new suggestions. Josef Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion