Re: [Numpy-discussion] Dates and times and Datetime64 (again)
Sankarshan Mudkavi smudkavi at uwaterloo.ca writes: Hey all, It's been a while since the last datetime and timezones discussion thread was visited (linked below): http://thread.gmane.org/gmane.comp.python.numeric.general/53805 It looks like the best approach to follow is the UTC only approach in the linked thread with an optional flag to indicate the timezone (to avoid confusing applications where they don't expect any timezone info). Since this is slightly more useful than having just a naive datetime64 package and would be open to extension if required, it's probably the best way to start improving the datetime64 library. snip I would like to start writing a NEP for this followed by implementation, however I'm not sure what the format etc. is, could someone direct me to a page where this information is provided? Please let me know if there are any ideas, comments etc. Cheers, Sankarshan See: http://article.gmane.org/gmane.comp.python.numeric.general/55191 You could use a current NEP as a template: https://github.com/numpy/numpy/tree/master/doc/neps I'm a huge +100 on the simplest UTC fix. As is, using numpy datetimes is likely to silently give incorrect results - something I've already seen several times in end-user data analysis code. Concrete Example: In [16]: dates = pd.date_range('01-Apr-2014', '04-Apr-2014', freq='H')[:-1] ...: values = np.array([1,2,3]).repeat(24) ...: records = zip(map(str, dates), values) ...: pd.TimeSeries(values, dates).groupby(lambda d: d.date()).mean() ...: Out[16]: 2014-04-011 2014-04-022 2014-04-033 dtype: int32 In [17]: df = pd.DataFrame(np.array(records, dtype=[('dates', 'M8[h]'), ('values', float)])) ...: df.set_index('dates', inplace=True) ...: df.groupby(lambda d: d.date()).mean() ...: Out[17]: values 2014-03-31 1.00 2014-04-01 1.041667 2014-04-02 2.041667 2014-04-03 3.00 [4 rows x 1 columns] Try it in your timezone and see what you get! -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dates and times and Datetime64 (again)
Dave, your example is not a problem with numpy per se, rather that the default generation is in local timezone (same as what python datetime does). If you localize to UTC you get the results that you expect. In [49]: dates = pd.date_range('01-Apr-2014', '04-Apr-2014', freq='H')[:-1] In [50]: pd.TimeSeries(values, dates.tz_localize('UTC')).groupby(lambda d: d.date()).mean() Out[50]: 2014-04-011 2014-04-022 2014-04-033 dtype: int64 In [51]: records = zip(map(str, dates.tz_localize('UTC')), values) In [52]: df = pd.DataFrame(np.array(records, dtype=[('dates', 'M8[h]'),('values', float)])) In [53]: df.set_index('dates').groupby(lambda x: x.date()).mean() Out[53]: values 2014-04-01 1 2014-04-02 2 2014-04-03 3 [3 rows x 1 columns] On Wed, Mar 19, 2014 at 5:21 AM, Dave Hirschfeld novi...@gmail.com wrote: Sankarshan Mudkavi smudkavi at uwaterloo.ca writes: Hey all, It's been a while since the last datetime and timezones discussion thread was visited (linked below): http://thread.gmane.org/gmane.comp.python.numeric.general/53805 It looks like the best approach to follow is the UTC only approach in the linked thread with an optional flag to indicate the timezone (to avoid confusing applications where they don't expect any timezone info). Since this is slightly more useful than having just a naive datetime64 package and would be open to extension if required, it's probably the best way to start improving the datetime64 library. snip I would like to start writing a NEP for this followed by implementation, however I'm not sure what the format etc. is, could someone direct me to a page where this information is provided? Please let me know if there are any ideas, comments etc. Cheers, Sankarshan See: http://article.gmane.org/gmane.comp.python.numeric.general/55191 You could use a current NEP as a template: https://github.com/numpy/numpy/tree/master/doc/neps I'm a huge +100 on the simplest UTC fix. As is, using numpy datetimes is likely to silently give incorrect results - something I've already seen several times in end-user data analysis code. Concrete Example: In [16]: dates = pd.date_range('01-Apr-2014', '04-Apr-2014', freq='H')[:-1] ...: values = np.array([1,2,3]).repeat(24) ...: records = zip(map(str, dates), values) ...: pd.TimeSeries(values, dates).groupby(lambda d: d.date()).mean() ...: Out[16]: 2014-04-011 2014-04-022 2014-04-033 dtype: int32 In [17]: df = pd.DataFrame(np.array(records, dtype=[('dates', 'M8[h]'), ('values', float)])) ...: df.set_index('dates', inplace=True) ...: df.groupby(lambda d: d.date()).mean() ...: Out[17]: values 2014-03-31 1.00 2014-04-01 1.041667 2014-04-02 2.041667 2014-04-03 3.00 [4 rows x 1 columns] Try it in your timezone and see what you get! -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [RFC] should we argue for a matrix power operator, @@?
On 16/03/2014 01:31, josef.p...@gmail.com wrote: On Sat, Mar 15, 2014 at 8:47 PM, Warren Weckesser warren.weckes...@gmail.com mailto:warren.weckes...@gmail.com wrote: On Sat, Mar 15, 2014 at 8:38 PM, josef.p...@gmail.com mailto:josef.p...@gmail.com wrote: I think I wouldn't use anything like @@ often enough to remember it's meaning. I'd rather see english names for anything that is not **very** common. I find A@@-1 pretty ugly compared to inv(A) A@@(-0.5) might be nice (do we have matrix_sqrt ?) scipy.linalg.sqrtm: http://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.sqrtm.html maybe a good example: I could never figured that one out M = sqrtm(A) A = M @ M but what we use in stats is A = R.T @ R (eigenvectors dot diag(sqrt of eigenvalues) which sqrt is A@@(0.5) ? Josef Agreed- In general, the matrix square root isn't a well-defined quantity. For some uses, the Cholesky decomposition is what you want, for some others it's the matrix with the same eigenvectors, but the square root of the eigenvalues, etc. etc. As an important aside, it would be good if the docs addressed this. Yours, Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dates and times and Datetime64 (again)
Jeff Reback jeffreback at gmail.com writes: Dave, your example is not a problem with numpy per se, rather that the default generation is in local timezone (same as what python datetime does). If you localize to UTC you get the results that you expect. The problem is that the default datetime generation in *numpy* is in local time. Note that this *is not* the case in Python - it doesn't try to guess the timezone info based on where in the world you run the code, if it's not provided it sets it to None. In [7]: pd.datetime? Type: type String Form:type 'datetime.datetime' Docstring: datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]) The year, month and day arguments are required. tzinfo may be None, or an instance of a tzinfo subclass. The remaining arguments may be ints or longs. In [8]: pd.datetime(2000,1,1).tzinfo is None Out[8]: True This may be the best solution but as others have pointed out this is more difficult to implement and may have other issues. I don't want to wait for the best solution - the assume UTC on input/output if not specified will solve the problem and this desperately needs to be fixed because it's completely broken as is IMHO. If you localize to UTC you get the results that you expect. That's the whole point - *numpy* needs to localize to UTC, not to whatever timezone you happen to be in when running the code. In a real-world data analysis problem you don't start with the data in a DataFrame or a numpy array it comes from the web, a csv, Excel, a database and you want to convert it to a DataFrame or numpy array. So what you have from whatever source is a list of tuples of strings and you want to convert them into a typed array. Obviously you can't localize a string - you have to convert it to a date first and if you do that with numpy the date you have is wrong. In [108]: dst = np.array(['2014-03-30 00:00', '2014-03-30 01:00', '2014-03- 30 02:00'], dtype='M8[h]') ...: dst ...: Out[108]: array(['2014-03-30T00+', '2014-03-30T00+', '2014-03- 30T02+0100'], dtype='datetime64[h]') In [109]: dst.tolist() Out[109]: [datetime.datetime(2014, 3, 30, 0, 0), datetime.datetime(2014, 3, 30, 0, 0), datetime.datetime(2014, 3, 30, 1, 0)] AFAICS there's no way to get the original dates back once they've passed through numpy's parser!? -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [help needed] associativity and precedence of '@'
On Tue, Mar 18, 2014 at 9:14 AM, Robert Kern robert.k...@gmail.com wrote: On Tue, Mar 18, 2014 at 12:54 AM, Nathaniel Smith n...@pobox.com wrote: On Sat, Mar 15, 2014 at 6:28 PM, Nathaniel Smith n...@pobox.com wrote: Mathematica: instead of having an associativity, a @ b @ c gets converted into mdot([a, b, c]) So, I've been thinking about this (thanks to @rfateman for pointing it out), and wondering if Mathematica's approach is worth following up more. (It would need to make it past python-dev, of course, but worst case is just that they say no and we're back where we are now, so we might as well think it through.) I predict with near-certainty that this will be rejected, I guess that's what everyone thought about @ too? ;-) but that doesn't prevent it from derailing the discussion. This proposal is unlike anything else in Python. Chained comparisons are *not* similar to this proposal. The chaining only happens at the syntax level, not the semantics. `a b c` gets compiled down to `a.__lt__(b) and b.__lt__(c)`, not `do_comparison([a, b, c], [lt, lt])`. Yes, the syntax is the same as chained comparisons, and the dispatch is a generalization of regular operators. It is unusual; OTOH, @ is unusual in that no other operators in Python have the property that evaluating in the wrong order can cost you seconds of time and gigabytes of memory. Perhaps. We have approval for a binary @ operator. Take the win. We have approval, and we have a request: that we figure out how @ should work in detail to be most useful to us. Maybe that's this proposal; maybe not. Ultimately rejected-or-not-rejected comes down to how strong the arguments for something are. And while we can make some guesses about that, it's impossible to know how strong an argument will be until one sits down and works it out. So I still would like to hear what people think, even if it just ends in the conclusion that it's a terrible idea ;-). As for arguments against the grouping semantics, I did think of one another case where @ is not associative, though it's pretty weird: In [9]: a = np.arange(16, dtype=np.int8).reshape((4, 4)) In [10]: np.dot(a, np.dot(a, a.astype(float))) Out[10]: array([[ 1680., 1940., 2200., 2460.], [ 4880., 5620., 6360., 7100.], [ 8080., 9300., 10520., 11740.], [ 11280., 12980., 14680., 16380.]]) In [12]: np.dot(np.dot(a, a), a.astype(float)) Out[12]: array([[ 1680., 1940., 2200., 2460.], [-1264., -1548., -1832., -2116.], [ 1936., 2132., 2328., 2524.], [-1008., -1100., -1192., -1284.]]) (What's happening is that we have int8 @ int8 @ float, so (int8 @ int8) @ float has overflows in the first computation, but int8 @ (int8 @ float) does all the computations in float, with no overflows.) -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [help needed] associativity and precedence of '@'
On Wed, Mar 19, 2014 at 2:24 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Mar 18, 2014 at 9:14 AM, Robert Kern robert.k...@gmail.com wrote: On Tue, Mar 18, 2014 at 12:54 AM, Nathaniel Smith n...@pobox.com wrote: On Sat, Mar 15, 2014 at 6:28 PM, Nathaniel Smith n...@pobox.com wrote: Mathematica: instead of having an associativity, a @ b @ c gets converted into mdot([a, b, c]) So, I've been thinking about this (thanks to @rfateman for pointing it out), and wondering if Mathematica's approach is worth following up more. (It would need to make it past python-dev, of course, but worst case is just that they say no and we're back where we are now, so we might as well think it through.) I predict with near-certainty that this will be rejected, I guess that's what everyone thought about @ too? ;-) but that doesn't prevent it from derailing the discussion. This proposal is unlike anything else in Python. Chained comparisons are *not* similar to this proposal. The chaining only happens at the syntax level, not the semantics. `a b c` gets compiled down to `a.__lt__(b) and b.__lt__(c)`, not `do_comparison([a, b, c], [lt, lt])`. Yes, the syntax is the same as chained comparisons, and the dispatch is a generalization of regular operators. It is unusual; OTOH, @ is unusual in that no other operators in Python have the property that evaluating in the wrong order can cost you seconds of time and gigabytes of memory. Perhaps. We have approval for a binary @ operator. Take the win. We have approval, and we have a request: that we figure out how @ should work in detail to be most useful to us. Maybe that's this proposal; maybe not. Ultimately rejected-or-not-rejected comes down to how strong the arguments for something are. And while we can make some guesses about that, it's impossible to know how strong an argument will be until one sits down and works it out. So I still would like to hear what people think, even if it just ends in the conclusion that it's a terrible idea ;-). What happens if you have 5 @ in a row? My head hurts if I had to think about what would actually be going on. and don't forget, the sparse matrix is stuck in the middle. But I would be happy to have a optimizing multi_dot or chain_dot function when it feels safe enough. As for arguments against the grouping semantics, I did think of one another case where @ is not associative, though it's pretty weird: In [9]: a = np.arange(16, dtype=np.int8).reshape((4, 4)) In [10]: np.dot(a, np.dot(a, a.astype(float))) Out[10]: array([[ 1680., 1940., 2200., 2460.], [ 4880., 5620., 6360., 7100.], [ 8080., 9300., 10520., 11740.], [ 11280., 12980., 14680., 16380.]]) In [12]: np.dot(np.dot(a, a), a.astype(float)) Out[12]: array([[ 1680., 1940., 2200., 2460.], [-1264., -1548., -1832., -2116.], [ 1936., 2132., 2328., 2524.], [-1008., -1100., -1192., -1284.]]) (What's happening is that we have int8 @ int8 @ float, so (int8 @ int8) @ float has overflows in the first computation, but int8 @ (int8 @ float) does all the computations in float, with no overflows.) That's similar to my example before that mixes in some scalar *. I thought of it as an argument for same-left Josef -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [help needed] associativity and precedence of '@'
On Sat, Mar 15, 2014 at 3:41 AM, Nathaniel Smith n...@pobox.com wrote: I think we need to know something about how often the Mat @ Mat @ vec type cases arise in practice. How often do non-scalar * and np.dot show up in the same expression? How often does it look like a * np.dot(b, c), and how often does it look like np.dot(a * b, c)? How often do we see expressions like np.dot(np.dot(a, b), c), and how often do we see expressions like np.dot(a, np.dot(b, c))? This would really help guide the debate. I don't have this data, and I'm not sure the best way to get it. A super-fancy approach would be to write a little script that uses the 'ast' module to count things automatically. A less fancy approach would be to just pick some code you've written, or a well-known package, grep through for calls to 'dot', and make notes on what you see. (An advantage of the less-fancy approach is that as a human you might be able to tell the difference between scalar and non-scalar *, or check whether it actually matters what order the 'dot' calls are done in.) Okay, I wrote a little script [1] to scan Python source files look for things like 'dot(a, dot(b, c))' or 'dot(dot(a, b), c)', or the ndarray.dot method equivalents. So what we get out is: - a count of how many 'dot' calls there are - a count of how often we see left-associative nestings: dot(dot(a, b), c) - a count of how often we see right-associative nestings: dot(a, dot(b, c)) Running it on a bunch of projects, I get: | project | dots | left | right | right/left | |--+--+--+---+| | scipy| 796 | 53 |27 | 0.51 | | nipy | 275 |3 |19 | 6.33 | | scikit-learn | 472 | 11 |10 | 0.91 | | statsmodels | 803 | 46 |38 | 0.83 | | astropy | 17 |0 | 0 |nan | | scikit-image | 15 |1 | 0 | 0.00 | |--+--+--+---+| | total| 2378 | 114 |94 | 0.82 | (Any other projects worth trying? This is something that could vary a lot between different projects, so it seems more important to get lots of projects here than to get a few giant projects. Or if anyone wants to run the script on their own private code, please do! Running it on my personal pile of random junk finds 3 left-associative and 1 right.) Two flaws with this approach: 1) Probably some proportion of those nested dot calls are places where it doesn't actually matter which evaluation order one uses -- dot() forces you to pick one, so you have to. If people prefer to, say, use the left form in cases where it doesn't matter, then this could bias the left-vs-right results -- hard to say. (Somewhere in this thread it was suggested that the use of the .dot method could create such a bias, because a.dot(b).dot(c) is more natural than a.dot(b.dot(c)), but only something like 6% of the dot calls here use the method form, so this probably doesn't matter.) OTOH, this also means that the total frequency of @ expressions where associativity even matters at all is probably *over*-estimated by the above. 2) This approach misses cases where the cumbersomeness of dot has caused people to introduce temporary variables, like 'foo = np.dot(a, b); bar = np.dot(foo, c)'. So this causes us to *under*-estimate how often associativity matters. I did read through the 'dot' uses in scikit-learn and nipy, though, and only caught a handful of such cases, so I doubt it changes anything much. -n [1] https://gist.github.com/njsmith/9157645#file-grep-dot-dot-py -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dates and times and Datetime64 (again)
On Mar 19, 2014, at 10:01 AM, Dave Hirschfeld novi...@gmail.com wrote: Jeff Reback jeffreback at gmail.com writes: Dave, your example is not a problem with numpy per se, rather that the default generation is in local timezone (same as what python datetime does). If you localize to UTC you get the results that you expect. The problem is that the default datetime generation in *numpy* is in local time. Note that this *is not* the case in Python - it doesn't try to guess the timezone info based on where in the world you run the code, if it's not provided it sets it to None. In [7]: pd.datetime? Type: type String Form:type 'datetime.datetime' Docstring: datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]) The year, month and day arguments are required. tzinfo may be None, or an instance of a tzinfo subclass. The remaining arguments may be ints or longs. In [8]: pd.datetime(2000,1,1).tzinfo is None Out[8]: True This may be the best solution but as others have pointed out this is more difficult to implement and may have other issues. I don't want to wait for the best solution - the assume UTC on input/output if not specified will solve the problem and this desperately needs to be fixed because it's completely broken as is IMHO. If you localize to UTC you get the results that you expect. That's the whole point - *numpy* needs to localize to UTC, not to whatever timezone you happen to be in when running the code. In a real-world data analysis problem you don't start with the data in a DataFrame or a numpy array it comes from the web, a csv, Excel, a database and you want to convert it to a DataFrame or numpy array. So what you have from whatever source is a list of tuples of strings and you want to convert them into a typed array. Obviously you can't localize a string - you have to convert it to a date first and if you do that with numpy the date you have is wrong. In [108]: dst = np.array(['2014-03-30 00:00', '2014-03-30 01:00', '2014-03- 30 02:00'], dtype='M8[h]') ...: dst ...: Out[108]: array(['2014-03-30T00+', '2014-03-30T00+', '2014-03- 30T02+0100'], dtype='datetime64[h]') In [109]: dst.tolist() Out[109]: [datetime.datetime(2014, 3, 30, 0, 0), datetime.datetime(2014, 3, 30, 0, 0), datetime.datetime(2014, 3, 30, 1, 0)] AFAICS there's no way to get the original dates back once they've passed through numpy's parser!? -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi all, I've written a rather rudimentary NEP, (lacking in technical details which I will hopefully add after some further discussion and receiving clarification/help on this thread). Please let me know how to proceed and what you think should be added to the current proposal (attached to this mail). Here is a rendered version of the same: https://github.com/Sankarshan-Mudkavi/numpy/blob/Enhance-datetime64/doc/neps/datetime-improvement-proposal.rst Cheers, Sankarshan -- Sankarshan Mudkavi Undergraduate in Physics, University of Waterloo www.smudkavi.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [help needed] associativity and precedence of '@'
On Mar 15, 2014, at 4:41 AM, Nathaniel Smith wrote: OPTION 1 FOR @: ... same-left OPTION 2 FOR @: ... weak-right OPTION 3 FOR @: ... tight-right (In addition to more unusual forms, like 'grouping'.) There's another option, which is to refuse the temptation to guess, and not allow X @ Y @ Z or mixing with any other operators. After all, several have pointed out that it should be in parenthesis anyway, in order to avoid likely confusion. There's even a bit of precedent for something like this in Python: f(1, 2 for i in range(10)) File stdin, line 1 SyntaxError: Generator expression must be parenthesized if not sole argument I haven't seen this non-associative option come up in the discussion. To be frank though, I don't think this is a good idea, but Nathaniel wrote In principle the other 2 possible options are ..., so I wanted to mention this for completion. My preference is for same-left. I rarely work with numpy, and it's more likely that I'll see '@' used in a non-numpy context. That is, people in general will see @ as a sort of free-for-all operator, to use and abuse as they wish. [1] (For example, Pyparsing has a lot of operator overloads to help make a grammar definition, and they make good sense in that context, but '' for recursive definitions is perhaps past the edge.) Someone looking at a @, without any intuition on precedence or associativity of matrix operations in a mathematical package, will have to figure things out from the documentation or (more likely) experimentation. If and when that happens, then Same-left is the easiest to explain and remember, because it's just, @ acts like * and /. Cheers, Andrew da...@dalkescientific.com I came up with two possible ways people might (ab)use it: 1) since @ is server-like, then a service resolver: service = XMLRPCServer @ http://localhost:1234/endpoint; There's no real need for this, since there are other equally good ways to structure this sort of call. But someone creative might come up with a good example of using '@' to mean some sort of routing. Interestingly, that creative person might prefer right-associative, to support the 'natural' (janet @ moria @ uunet @ uucpserver).send(message) rather than the inverted: (uucpserver @ uunet @ moria @ janet).send(message) This would likely fall under the definition of cute and ignorable. 2) @ in XPath indicates an attribute. An XML tree API might support something like: tree = load_xml_tree(...) for node in tree.select(//item[@price 2*@discount]): print node @ price, node @ discount That might even be a reasonable short-hand, compared to, say, etree's node.attrib[price] XML doesn't allow nodes as attributes, so that provides no guidance as to what node @ price @ 1950 might mean. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion