[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2021-07-02 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

I was contacted by someone interested in this so I've posted the last version 
above as a GitHub gist under the MIT license:
https://gist.github.com/oscarbenjamin/4c1b977181f34414a425f68589e895d1

--

___
Python tracker 
<https://bugs.python.org/issue41311>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43602] Include Decimal's in numbers.Real

2021-04-16 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

I've never found numbers.Real/Complex to be useful. The purpose of the ABCs 
should be that they enable you to write code that works for instances of any 
subclass but in practice writing good floating point code requires knowing 
something e.g. the base, precision, max exponent etc of the type. Also many 
implementations like Decimal have contexts and rounding control etc that need 
to be used and the ABC gives no way to know that or to do anything with it.

The main thing that is useful about the Rational/Integer ABCs is that they 
define the numerator and denominator attributes which makes different 
implementations interoperable by providing exact conversion. If Real was 
presumed to represent some kind of floating point type then an analogous 
property/method would be something that can deconstruct the object in an exact 
way like:

mantissa, base, exponent = deconstruct(real)

You would also need a way to handle nan, inf etc. Note that as_integer_ratio() 
is not suitable because it could generate enormous integers unnecessarily e.g. 
Decimal('1E+1').as_integer_ratio().

Instead the Real ABC only defines conversion to float. That's useful in the 
sense that you can write code for float and pass in some other floating point 
type and have everything reduce to float. You don't need an ABC for that though 
because __float__ does everything. In practice most alternate "real" number 
implementations exist precisely to be better than float in some way by either 
having greater range/precision or a different base but anything written for the 
Real ABC is essentially reduced to float as a lowest common (inexact) 
denominator.

--

___
Python tracker 
<https://bugs.python.org/issue43602>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-19 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

Yeah, I guess it's a YAGNI.

Thanks Raymond and Tim for looking at this!

--

___
Python tracker 
<https://bugs.python.org/issue41311>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-18 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

> Please don't get personal.

Sorry, that didn't come across with the intended tone :)

I agree that this could be out of scope for the random module but I wanted to 
make sure the reasons were considered.

Reading between the lines I get the impression that you'd both be happier with 
it if the algorithm was exact (rather than using fp). That would at least give 
the possibility for it to be used internally by e.g. sample/choice if there was 
a benefit for some cases.

--

___
Python tracker 
<https://bugs.python.org/issue41311>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

> At its heart, this a CPython optimization to take advantage of list() being 
> slower than a handful of islice() calls.

This comment suggest that you have missed the general motivation for reservoir 
sampling. Of course the stdlib can not satisfy all use cases so this can be out 
of scope as a feature. It is not the case though that this is a CPython 
optimisation.

The idea of reservoir sampling is that you want to sample from an iterator, you 
only get one chance to iterate over it, and you don't know a priori how many 
items it will yield. The classic example of that situation is reading from a 
text file but in general it maps neatly onto Python's concept of iterators. The 
motivation for generators/iterators in Python is that there are many situations 
where it is better to avoid building a concrete in-memory data structure and it 
can be possible to avoid doing so with appropriately modified algorithms (such 
as this one). 

The core use case for this feature is not sampling from an in-memory data 
structure but rather sampling from an expensive generator or an iterator over a 
file/database. The premise is that it is undesirable or perhaps impossible to 
build a list out of the items of the iterable. In those contexts the 
comparative properties of sample/choices are to some extent irrelevant because 
those APIs can not be used or should be avoided because of their overhead in 
terms of memory or other resources.

--

___
Python tracker 
<https://bugs.python.org/issue41311>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Oscar Benjamin

Oscar Benjamin  added the comment:

All good points :)

Here's an implementation with those changes and that shuffles but gives the 
option to preserve order. It also handles the case W=1.0 which can happen at 
the first step with probability 1 - (1 - 2**53)**k.

Attempting to preserve order makes the storage requirements expected 
O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just 
refers to the values list growing larger with references to None: only k of the 
items from iterable are stored at any time. This can be simplified by removing 
the option to preserve order which would also make it faster in the 
small-iterable case.

There are a few timings below for choosing from a dict vs converting to a list 
and using sample (I don't have a 3.9 build immediately available to use 
choices). Note that these benchmarks are not the primary motivation for 
sample_iter though which is the case where the underlying iterable is much more 
expensive in memory and/or time and where the length is not known ahead of time.



from math import exp, log, log1p, floor
from random import random, randrange, shuffle as _shuffle
from itertools import islice


def sample_iter(iterable, k=1, shuffle=True):
"""Choose a sample of k items from iterable

shuffle=True (default) gives the items in random order
shuffle=False preserves the original ordering of the items
"""
iterator = iter(iterable)
values = list(islice(iterator, k))

irange = range(len(values))
indices = dict(zip(irange, irange))

kinv = 1 / k
W = 1.0
while True:
W *= random() ** kinv
# random() < 1.0 but random() ** kinv might not be
# W == 1.0 implies "infinite" skips
if W == 1.0:
break
# skip is geometrically distributed with parameter W
skip = floor( log(random())/log1p(-W) )
try:
newval = next(islice(iterator, skip, skip+1))
except StopIteration:
break
# Append new, replace old with dummy, and keep track of order
remove_index = randrange(k)
values[indices[remove_index]] = None
indices[remove_index] = len(values)
values.append(newval)

values = [values[indices[i]] for i in irange]

if shuffle:
_shuffle(values)

return values


Timings for a large dict (1,000,000 items):

In [8]: n = 6   
   

In [9]: d = dict(zip(range(10**n), range(10**n)))   
   

In [10]: %timeit sample_iter(d, 10) 
   
16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit sample(list(d), 10)
   
26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Timings for a small dict (5 items):

In [14]: d2 = dict(zip(range(5), range(5))) 
   

In [15]: %timeit sample_iter(d2, 2) 
   
14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit sample(list(d2), 2)
   
6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)


The crossover point for this benchmark is around 10,000 items with k=2. 
Profiling at 10,000 items with k=2 shows that in either case the time is 
dominated by list/next so the time difference is just about how efficiently we 
can iterate vs build the list. For small dicts it is probably possible to get a 
significant factor speed up by removing the no shuffle option and simplifying 
the routine.


> Although why it keeps taking k'th roots remains a mystery to me ;-)

Thinking of sample_iter_old, before doing a swap the uvals in our reservoir 
look like:

  U0 = {u[1], u[2], ... u[k-1], W0}
  W0 = max(V0)

Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we 
swap in while removing W0 and afterwards we have

  U1 = {u[1], u[2], ... u[k-1], u[k]}
  W1 = max(U1)

Given that U1 is k iid uniform variates in (0, W0) we have that

  W1 = W0 * max(random() for _ in range(k)) = W0 * W'

Here W' has cdf x**k and so by the inverse sampling method we can generate it 
as random()**(1/k). That gives the update rule for sample_iter:

  W *= random() ** (1/k)

--

___
Py

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-16 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

To be clear I suggest that this could be a separate function from the existing 
sample rather than a replacement or a routine used internally.

The intended use-cases for the separate function are:

1. Select from something where you really do not want to build a list and just 
want/need to use a single pass. For example in selecting a random line from a 
text file it is necessary to read the entire file in any case just to know how 
many lines there are. The method here would mean that you could make the 
selection in a single pass in O(k) memory. The same can apply to many 
situations involving generators etc.

2. Convenient, possibly faster selection from a most-likely small dict/set 
(this was the original context from python-ideas).

The algorithm naturally gives something in between the original order or a 
randomised order. There are two possibilities for changing that:

a. Call random.shuffle or equivalent either to get the original k items in a 
random order or at the end before returning.

b. Preserve the original ordering from the iterable: append all new items and 
use a sentinel for removed items (remove sentinels at the end).

Since randomised order does not come naturally and requires explicit shuffling 
my preference would be to preserve the original order (b.) because there is no 
other way to recover the information lost by shuffling (assuming that only a 
single pass is possible). The user can call shuffle if they want.

To explain what "geometrically distributed" means I need to refer to the 
precursor algorithm from which this is derived. A simple Python version could 
be:


def sample_iter_old(iterable, k=1):
uvals_vals = []
# Associate a uniform (0, 1) with each element:
for uval, val in zip(iter(random, None), iterable):
uvals_vals.append((uval, val))
uvals_vals.sort()
uvals_vals = uvals_vals[:k]   # keep the k items with smallest uval
return [val for uval, val in uvals_vals]


In sample_iter_old each element val of the iterable is associated with a 
uniform (0, 1) variate uval. At each step we keep the k elements having the 
smallest uval variates. This is relatively inefficient because we need to 
generate a uniform variate for each element val of the iterable. Most of the 
time during the algorithm the new val is simply discarded so sample_iter tries 
instead to calculate how many items to discard.

The quantity W in sample_iter is the max of the uvals from sample_iter_old:

W := max(uval, for uval, val in uvals_vals)

A new item from the iterable will be swapped in if its uval is less than W. The 
number of items skipped before finding a uval < W is geometrically distributed 
with parameter W.

--

___
Python tracker 
<https://bugs.python.org/issue41311>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-15 Thread Oscar Benjamin

New submission from Oscar Benjamin :

The random.choice/random.sample functions will only accept a sequence to select 
from. Can there be a function in the random module for selecting from an 
arbitrary iterable?

It is possible to make an efficient function that can make random selections 
from an arbitrary iterable e.g.:


from math import exp, log, floor
from random import random, randrange
from itertools import islice

def sample_iter(iterable, k=1):
"""Select k items uniformly from iterable.

Returns the whole population if there are k or fewer items
"""
iterator = iter(iterable)
values = list(islice(iterator, k))

W = exp(log(random())/k)
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
selection = list(islice(iterator, skip, skip+1))
if selection:
values[randrange(k)] = selection[0]
W *= exp(log(random())/k)
else:
return values


https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm


This could be used for random sampling from sets/dicts or also to choose 
something like a random line from a text file. The algorithm needs to fully 
consume the iterable but does so efficiently using islice. In the case of a 
dict this is faster than converting to a list and using random.choice:


In [2]: n = 6

In [3]: d = dict(zip(range(10**n), range(10**n)))

In [4]: %timeit sample_iter(d)
15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each

In [5]: %timeit list(d)
26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit sample_iter(d, 2)
15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit sample_iter(d, 20)
17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit sample_iter(d, 100)
19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


This was already discussed on python-ideas:
https://mail.python.org/archives/list/python-id...@python.org/thread/4OZTRD7FLXXZ6R6RU4BME6DYR3AXHOBD/

--
components: Library (Lib)
messages: 373733
nosy: oscarbenjamin
priority: normal
severity: normal
status: open
title: Add a function to get a random sample from an iterable (reservoir 
sampling)

___
Python tracker 
<https://bugs.python.org/issue41311>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-20 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

Sorry, sent too soon...

> Matlab doesn't support even weighted mean as far as I can tell. There
> is wmean on the matlab file exchange:
https://stackoverflow.com/a/36464881/9450991

This is a separate function `wmean(data, weights)`. It has to be a
separate function though because it's third party code so the author
couldn't change the main mean function.

R ships with a weighted.mean function but I think for standard
deviation you need third party libs.

A quick survey but the main impression I get is that providing API for
this is not that common. The only good-looking API is the statsmodel
one.

--

___
Python tracker 
<https://bugs.python.org/issue20479>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-20 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

> I would find it very helpful if somebody has time to do a survey of
> other statistics libraries or languages (e.g. numpy, R, Octave, Matlab,
> SAS etc) and see how they handle data with weights.

Numpy has only sporadic support for this. The standard mean function
does not have any way to provide weights but there is an alternative
called average that computes the mean and has an optional weights
argument. I've never heard of average before searching for "numpy
weighted mean" just now. Numpy's API often has bits of old cruft from
where various numerical packages were joined together so I'm not sure
they would recommend their current approach. I don't think there are
any other numpy functions for providing weighted statistics.

Statsmodels does provide an API for this as explained here:
https://stackoverflow.com/a/36464881/9450991
Their API is that you create an object with data and weights and can
then call methods/attributes for statistics.

Matlab doesn't support even weighted mean as far as I can tell. There
is wmean on the matlab file exchange:

>
> - what APIs do they provide?
> - do they require weights to be positive integers, or do they
>   support arbitrary float weights?
> - including negative weights?
>   (what physical meaning does a negative weight have?)
>
> At the moment, a simple helper function seems to do the trick for
> non-negative integer weights:
>
> def flatten(items):
> for item in items:
> yield from item
>
> py> data = [1, 2, 3, 4]
> py> weights = [1, 4, 1, 2]
> py> statistics.mean(flatten([x]*w for x, w in zip(data, weights)))
> 2.5
>
> In principle, the implementation could be as simple as a single
> recursive call:
>
> def mean(data, weights=None):
> if weights is not None:
> return mean(flatten([x]*w for x, w in zip(data, weights)))
> # base case without weights is unchanged
>
> or perhaps it could be just a recipe in the docs.
>
> --
>
> ___
> Python tracker 
> <https://bugs.python.org/issue20479>
> ___

--

___
Python tracker 
<https://bugs.python.org/issue20479>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25412] __floordiv__ in module fraction fails with TypeError instead of returning NotImplemented

2015-10-16 Thread Oscar Benjamin

Oscar Benjamin added the comment:

You should test the change with number types that don't use the number tower 
e.g. Decimal, sympy, gmpy2, mpf, numpy arrays etc. Few non stdlib types use the 
number ABCs so testing against numbers.Complex may cause a change in behaviour.

--
nosy: +oscarbenjamin

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25412>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25355] Windows 3.5 installer does not add python to "App Paths" key

2015-10-09 Thread Oscar Benjamin

New submission from Oscar Benjamin:

>From the mailing list:
https://mail.python.org/pipermail/python-list/2015-October/697744.html

'''
The new installer for 3.5 doesn't create an "App Paths" key for
"python.exe" like the old installer used to do (see the old
Tools/msi/msi.py). Without that, unless python.exe is in the search
PATH, "Win+R -> python" and running "start python" in the command
prompt won't work. You can of course add the key manually as the
default value for
"[HKLM|HKCU]\Software\Microsoft\Windows\CurrentVersion\App
Paths\python.exe". IMO, this makes it the 'system' Python.
'''

Is this an intentional change or an oversight?

--
components: Installation
messages: 252618
nosy: oscarbenjamin
priority: normal
severity: normal
status: open
title: Windows 3.5 installer does not add python to "App Paths" key
versions: Python 3.5

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25355>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20575] Type handling policy for the statistics module

2014-02-09 Thread Oscar Benjamin

New submission from Oscar Benjamin:

As of issue20481, the statistics module for Python 3.4 will disallow any mixing 
of numeric types with the exception of int that can mix with any other type 
(but only one at a time). My understanding is that this change was not 
necessarily considered to be a permanent policy but rather a quick fix for 
Python 3.4 in order to explicitly prevent certain confusing situations arising 
from mixing Decimal with other stdlib numeric types.

issue20499 has a lot of discussion about different ways to improve accuracy and 
speed for the mean, variance etc. functions in the statistics module. It's 
tricky though to come up with a concrete implementation without having a clear 
specification for how the module should handle different numeric types.

There are several related issues to do with type handling. Should the 
statistics module
1) Use the same coercion rules as the numeric tower (pep-3141)?
2) Allow Decimal to mix with any types from the numeric tower?
3) Allow non-stdlib types that don't use the numeric tower?
4) Allow any mixing of types at all?
5) Strive to achieve the maximum possible accuracy for every type that it 
accepts?

I don't personally see much of a use-case for mixing e.g. Decimal and Fraction. 
I don't think it's unreasonable to require users to choose a numeric type and 
stick to it. The common cases will almost certainly be either all int or all 
float so those should be the main targets of any speed optimisation.

If a user is using Fraction/Decimal then they must have gone out of their way 
to do so and they may as well do so  consistently for all of their data. When 
choosing to use Fraction you do so because you want perfect accuracy. Mixing 
those Fractions with floating point types such as float and Decimal doesn't 
make any sense. Although there is a sense in which Decimals are also exact 
since they are always exact in their constructor. However I don't think there's 
any case where the Decimal constructor can be used but the Fraction constructor 
cannot so this mixing of types is unnecessary.

As with Fraction a user who chooses to use Decimal is going out of their way to 
do so because of the kind of accuracy guarantees that the type provides. It 
doesn't make any sense to mix these with floats that are inherently tainted 
with the wrong kind of rounding error. So mixing Decimal and float doesn't make 
any sense either.

Note that ordinary arithmetic prohibits the mixing of Decimal with 
Fraction/float so that on this point the statistics module is essentially 
maintaining a consistent position with respect to the policy of the Decimal 
type.

On the other hand ordinary arithmetic allows all of int, float, Fraction and 
complex and indeed any other type subscribing to the ABCs in the numeric tower 
to be mixed. As of issue20481 the statistics module does not allow any type 
mixing except for int:
http://hg.python.org/cpython/rev/5db74cd953ab
Note also that it uses type identity rather than subclass relationships or ABCs 
so that it is not even possible to mix e.g. float with a float subclass.

The most common case of mixing will almost certainly be int and float which 
will work. However I doubt that the current policy would be considered to be in 
keeping with Python's general policy on numeric types and anticipate that there 
will be a desire to change it in the future. The obvious candidate for a policy 
is the numeric tower and ABCs of PEP-3141. In that case the statistics module 
has a partial precedent on which to base its policy. The only tricky part is 
that Decimal is not part of the numeric tower. So there needs to be a special 
rule for Decimal such as it only mixes with int/Integral.

Basing the policy on the numeric tower is attractive but it is worth noting 
that the std lib types int, float, Fraction and Decimal are the only types that 
actually implement and register with these ABCs. So it's not much different 
from saying that those particular types (and subclasses of) are accepted but I 
think that that is better than the current policy. 

Third party numeric types don't implement the interfaces described in PEP-3141. 
However one thing that is implemented by every third-party numeric type that I 
know of is __float__. So if there was to be a desire to support those in the 
statistics module then the simplest extension of the policy on types is to say 
that any non-numeric-tower types will simply be coerced with float. This still 
leaves the issue about how type mixing works there but, again, perhaps the 
safest option before the need arises is just to say that no type mixing is 
allowed if any input object is not from the numeric tower.

What do you think?

--
components: Library (Lib)
messages: 210762
nosy: ncoghlan, oscarbenjamin, skrah, stevenjd, wolma
priority: normal
severity: normal
status: open
title: Type handling policy for the statistics module
type: enhancement
versions: Python 3.5

[issue20481] Clarify type coercion rules in statistics module

2014-02-08 Thread Oscar Benjamin

Oscar Benjamin added the comment:

 Close #20481: Disallow mixed type input in statistics

If I understand correctly the reason for hastily pushing this patch
through is that it's the safe option: disallow mixing types as a quick
fix for soon to be released 3.4. If we want to allow mixing types then
that can be a new feature for 3.5.

Is that correct?

If so should the discussion about what to do in 3.5 take place in this
issue (i.e. reopen for 3.5) or should it be a new issue? issue20499
would benefit from clarity about the statistics module policy for
mixing types.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20481
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20499] Rounding errors with statistics.variance

2014-02-07 Thread Oscar Benjamin

Oscar Benjamin added the comment:

A fast Decimal.as_integer_ratio() would be useful in any case.

If you're going to use decimals though then you can trap inexact and
keep increasing the precision until it becomes exact. The problem is
with rationals that cannot be expressed in a finite number of decimal
digits - these need to be handled separately. I've attached
decimalsum.py that shows how to compute an exact sum of any mix of
int, float and Decimal, but not Fraction.

When I looked at this before, having special cases for everything from
int to float to Decimal to Fraction makes the code really complicated.
The common cases are int and float. For these cases sum() and fsum()
are much faster. However you need to also have code that checks
everything in the iterable.

One option is to do something like:

import math
import itertools
from decimal import Decimal
from decimalsum import decimalsum

def _sum(numbers):
subtotals = []
for T, nums in itertools.groupby(numbers, type):
if T is int:
subtotals.append(sum(nums))
elif T is float:
subtotals.append(math.fsum(nums))
elif T is Decimal:
subtotals.append(decimalsum(nums))
else:
raise NotImplementedError
return decimalsum(subtotals)

The main problem here is that fsum rounds every time it returns
meaning that this sum is order-dependent if there are a mix of floats
and other types (See issue19086 where I asked for way to change that).

Also having separate code blocks to manage all the different types
internally in e.g. the less trivial variance calculations is tedious.

--
Added file: http://bugs.python.org/file33960/decimalsum.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20499
___from decimal import getcontext, Inexact, Decimal

def decimalsum(iterable, start=Decimal('0')):
'''Exact sum of Decimal/int/float mix; Result is *unrounded*'''
if not isinstance(start, Decimal):
start = Decimal(start)
# We need our own context and we can't just set it once because
# the loop could be over a generator/iterator/coroutine
ctx = getcontext().copy()
ctx.traps[Inexact] = True
one = Decimal(1)

total = start
for x in iterable:
if not isinstance(x, Decimal):
x = Decimal(x)
# Increase the precision until we get an exact result.
while True:
try:
total = total.fma(one, x, ctx)
break
except Inexact:
ctx.prec *= 2

# Result is exact and unrounded.
return total


D = Decimal
assert decimalsum([D(1.02), 3e100, D(0.98), -3e100]) == 2
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20499] Rounding errors with statistics.variance

2014-02-06 Thread Oscar Benjamin

Changes by Oscar Benjamin oscar.j.benja...@gmail.com:


--
nosy: +wolma

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20499
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20481] Clarify type coercion rules in statistics module

2014-02-04 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I was working on the basis that we were talking about Python 3.5.

But now I see that it's a 3.4 release blocker. Is it really that urgent?

I think the current behaviour is very good at handling a wide range of types.
It would be nice to consistently report errors for incompatible types but it
can also just be documented as a thing that users shouldn't do.

If there were a situation where it silently returned a highly inaccurate
value I would consider that urgent but I don't think there is.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20481
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-03 Thread Oscar Benjamin

Oscar Benjamin added the comment:

 in my previous message. To support weights (float or Rational) this would 
 have to be more sophisticated.

I guess you'd do:

 for x,w in data.items():
 T = _coerce_types(T, type(x))
 xn, xd = exact_ratio(x)
 wn, wd = exact_ratio(w)
 partials[d] = partials_get(xd * wd, 0) + xn * wn

Variance is only slightly trickier. Median would be more complicated.

I just think that I prefer to know when I look at code that something is being
treated as a mapping or as an iterable. So when I look at

d = f(x, y, z)
v = variance_map(d)

It's immediately obvious what d is and how the function variance_map is using
it.

As well as the benefit of readability there's also the fact that accepting
different kinds of input puts strain on any attempt to modify your code in the
future. Auditing the code requires understanding at all times that the name
data is bound to a quantum superposition of different types of object.

Either every function would have to have the same iterable or mapping
interface or there would have to be some other convention for making it clear
which ones do. Perhaps the functions that don't make sense for a mapping could
explicitly reject them rather than treating them as an iterable.

I just think it's simpler to have a different function name for each type of
input. Then it's clear what functions are available for working with mappings.

If you were going for something completely different then you could have an
object-oriented interface where there are classes for the different types of
data and methods that do the right thing in each case.

Then you would do

v = WeightedData(d).variance()

The ordinary variance() function could just become a shortcut for

def variance(data):
return SequenceData(data).variance()

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20481] Clarify type coercion rules in statistics module

2014-02-03 Thread Oscar Benjamin

Oscar Benjamin added the comment:

It's not as simple as registering with an ABC. You also need to provide the
interface that the ABC represents:

 import sympy
 r = sympy.Rational(1, 2)
 r
1/2
 r.numerator
Traceback (most recent call last):
  File stdin, line 1, in module
AttributeError: 'Half' object has no attribute 'numerator'

AFAIK there are no plans by any third part libraries to increase their
inter-operability with the numeric tower.

My point is that in choosing what types to accept and how to coerce them you
should focus on actual practical benefits rather than theoretical ones. If it
can be done in a way that means it works for more numeric types then that's
great. But when I say works I mean that it should ideally achieve the best
possible accuracy for each type.

If that's not possible then it might be simplest to just document how it works
for combinations of the std lib types (and perhaps subclasses thereof) and
then say that it will fall back on coercing to float for anything else. This
approach is simpler to document and for end-users to understand. It also has
the benefit that it will work for all non std lib types (that I'm aware of)
without pretending to achieve more accuracy than it can.

 import sympy, fractions, gmpy
 fractions.Fraction(sympy.Rational(1, 2))
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/lib/python2.7/fractions.py, line 148, in __new__
raise TypeError(argument should be a string 
TypeError: argument should be a string or a Rational instance
 float(sympy.Rational(1, 2))
0.5
 fractions.Fraction(gmpy.mpq(1, 2))
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/lib/python2.7/fractions.py, line 148, in __new__
raise TypeError(argument should be a string 
TypeError: argument should be a string or a Rational instance
 float(gmpy.mpq(1, 2))
0.5

Coercion to float via __float__ is well supported in the Python ecosystem.
Consistent support for getting exact integer ratios is (unfortunately) not.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20481
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20389] clarify meaning of xbar and mu in pvariance/variance of statistics module

2014-02-03 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I agree that the current wording in the doc-strings is ambiguous. It should be 
more careful to distinguish between

mu : true/population mean
xbar : estimated/sample mean

I disagree that the keyword arguments should be made the same. There is an 
important conceptual difference between these two things that the user needs to 
be aware of and mu, xbar - as symbols rather than ascii characters - are widely 
used for this. See e.g. this Wikipedia entry (although it uses ybar instead of 
xbar):
http://en.wikipedia.org/wiki/Variance#Population_variance

--
nosy: +oscarbenjamin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20389
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20499] Rounding errors with statistics.variance

2014-02-03 Thread Oscar Benjamin

New submission from Oscar Benjamin:

The mean/variance functions in the statistics module don't quite round 
correctly.

The reasons for this are that although exact rational arithmetic is used 
internally in the _sum function it is not used throughout the module. In 
particular the _sum function should be changed to return an exact result and 
exact arithmetic should be used right up to the point before returning to the 
user at which point a rounding/coercion should be used to give the user their 
answer in the appropriate type correctly rounded once.

Using exact arithmetic everywhere makes it possible to replace all of the 
variance* functions with single-pass algorithms based on the computational 
formula for variance which should be more efficient as well.

For example the following implements pvariance so that it returns a perfectly 
rounded result for float input and output in a single pass:

def pvariance(data):
sx = 0
sx2 = 0
for n, x in enumerate(map(Fraction, data), 1):
sx += x
sx2 += x ** 2
Ex = sx / n
Ex2 = sx2 / n
var = Ex2 - Ex ** 2
return float(var)

Comparing the above with the statistics module:

 pvariance([0, 0, 1])
0.
 statistics.pvariance([0, 0, 1])
0.4

The true answer is:

 from fractions import Fraction as F
 float(statistics.pvariance([F(0), F(0), F(1)]))
0.

The logic in the _sum function for computing exact integer ratios and coercing 
back to the output type could be moved into utility functions so that it does 
not need to be duplicated.

Some examples of rounding issues:

 from statistics import variance, mean
 from decimal import Decimal as D, getcontext
 from fractions import Fraction as F

Variance with ints or floats returns a float but the float is not quite the 
nearest possible float:

 variance([0, 0, 2])
1.3335
 float(variance([F(0), F(0), F(2)]))  # true result rounded once
1.

Another example with Decimal:

 getcontext().prec = 5
 getcontext()
Context(prec=5, rounding=ROUND_HALF_EVEN, Emin=-9, Emax=9, 
capitals=1, clamp=0, flags=[Rounded, Inexact], traps=[DivisionByZero, Overflow, 
InvalidOperation])

 variance([D(0), D(0), D(2)] * 2)  # Rounded down instead of up
Decimal('1.0666')
 r = (variance([F(0), F(0), F(2)] * 2))
 D(r.numerator) / r.denominator  # Correctly rounded
Decimal('1.0667')

The mean function may also not be correctly rounded:

 getcontext().prec = 2
 r = mean((F('1.2'), F('1.3'), F('1.55')))
 r
Fraction(27, 20)
 D(r.numerator) / r.denominator # Correctly rounded
Decimal('1.4')
 mean([D('1.2'), D('1.3'), D('1.55')])
Decimal('1.3')

--
components: Library (Lib)
messages: 210121
nosy: oscarbenjamin, stevenjd
priority: normal
severity: normal
status: open
title: Rounding errors with statistics.variance
versions: Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20499
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20481] Clarify type coercion rules in statistics module

2014-02-03 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I agree that supporting non-stdlib types is in some ways a separate issue from
how to manage coercion with mixed stdlib types. Can you provide a complete
patch (e.g. hg diff  coerce_types.patch).
http://docs.python.org/devguide/

There should probably also be tests added for situations where the current
implementation behaves undesirably. Something like these ones:
http://hg.python.org/cpython/file/a97ce3ecc96a/Lib/test/test_statistics.py#l1445

Note that when I said non-stdlib types can be handled by coercing to float I
didn't mean that the output should be coerced to float but rather the input
should be coerced to float because __float__ is the most consistent interface
available on third party numeric types.

Once the input numbers are converted to float statistics._sum can handle them
perfectly well. In this case I think the output should also be a float so that
it's clear that precision may have been lost. If the precision of float is not
what the user wants then the documentation can point them toward
Fraction/Decimal.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20481
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-02 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 2 February 2014 11:55, Steven D'Aprano rep...@bugs.python.org wrote:

 (1) separate functions, as Nick suggests:
 mean vs weighted_mean, stdev vs weighted_stdev

This would be my preferred approach. It makes it very clear which
functions are available for working with map style data. It will be
clear from both the module documentation and a casual introspection of
the module that those APIs are present for those who might want them.
Also apart from mode() the implementation of each function on
map-format data will be completely different from the iterable version
so you'd want to have it as a separate function at least internally
anyway.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20481] Clarify type coercion rules in statistics module

2014-02-02 Thread Oscar Benjamin

Oscar Benjamin added the comment:

Wolfgang have you tested this with any third party numeric types from
sympy, gmpy2, mpmath etc.?

Last I checked no third party types implement the numbers ABCs e.g.:

 import sympy, numbers
 r = sympy.Rational(1, 2)
 r
1/2
 isinstance(r, numbers.Rational)
False

AFAICT testing against the numbers ABCs is just a slow way of testing
against the stdlib types:

$ python -m timeit -s 'from numbers import Integral' 'isinstance(1, Integral)'
10 loops, best of 3: 2.59 usec per loop
$ python -m timeit -s 'from numbers import Integral' 'isinstance(1, int)'
100 loops, best of 3: 0.31 usec per loop

You can at least make it faster using a tuple:

$ python -m timeit -s 'from numbers import Integral' 'isinstance(1,
(int, Integral))'
100 loops, best of 3: 0.423 usec per loop

I'm not saying that this is necessarily a worthwhile optimisation but
rather that the numbers ABCs are in practice not really very useful
(AFAICT).

I don't know how well the statistics module currently handles third
party numeric types but if the type coercion is to be changed then
this something to look at. The current implementation tries to
duck-type to some extent and yours uses ABCs but does either approach
actually have any practical gain for interoperability with non-stdlib
numeric types? If not then it would be simpler just to explicitly
hard-code exactly how it works for the powerset of stdlib types.

OTOH if it could be made to do sensible things with non-stdlib types
then that would be great. Falling back on float isn't a bad choice but
if it can be made to do exact things for exact types (e.g.
sympy.Rational) then that would be great. Similarly mpmath.mpf
provides multi-precision floats. It would be great to be able to take
advantage of that higher precision rather than downgrade everything to
float.

This is in general a hard problem though so I don't think it's
unreasonable to make restrictions about what types it can work with -
achieving optimal accuracy for all types without restrictions is
basically impossible.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20481
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-10-01 Thread Oscar Benjamin

Oscar Benjamin added the comment:

Thanks Antoine!

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19086] Make fsum usable incrementally.

2013-09-30 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I should be clearer about my intentions. I'm hoping to create an
efficient and accurate sum() function for the new stdlib statistics
module:
http://bugs.python.org/issue18606
http://www.python.org/dev/peps/pep-0450/

The sum() function currently proposed can be seen in this patch:
http://bugs.python.org/file31680/statistics_combined.patch

It uses Fractions to compute an exact sum and separately tries to keep
track of type coercion to convert the end result. It is accurate and
returns a result of the appropriate type for any mix of ints,
Fractions, Decimals and floats with the caveat that mixing Fractions
and Decimals is disallowed. I believe the most common use-cases will
be for ints/floats and for these cases it is 100x slower than
sum/fsum.

The discussion here:
http://bugs.python.org/issue18606#msg195630
resulted in the sum function that you can see in the above patch.
Following that I've had off-list discussions with the author of the
module about improving the speed of the sum function (which is a
bottleneck for half of the functions exposed by the module). He is
very much interested in speed optimisations but doesn't want to
compromise on accuracy.

My own interest is that I would like an efficient and accurate (for
all types) sum function *somewhere* in the stdlib. The addition of the
statistics module with its new sum() function is a good opportunity to
achieve this. If this were purely for myself I wouldn't bother this
much with speed optimisation since I've probably already spent more of
my own time thinking about this function than I ever would running it!
(If I did want to specifically speed this up in my own work I would,
as you say, use cython).

If fsum() were modified in the way that I describe that it would be
usable as a primitive for the statistics.sum() function and also for
parallel etc. computation. I agree that the carry argument is
unnecessary and that the main is just the exactness of the return
value: it seems a shame that fsum could so easily give me an exact
result but doesn't.

As for exposing the internals of fsum, is it not the case that a
*non-overlapping* set of floats having a given exact sum is a uniquely
defined set? For msum I believe it is only necessary to strip out
zeros in order to normalise the output so that it is uniquely defined
by the true value of the sum in question (the list is already in
ascending order once the zeros are removed).

Uniqueness: If a many-digit float is given by something like
(+-)abcdefghijkl * 2 ** x and we want to break it into non-overlapping
2-digit floats of the form ab*2**x. Since the first digit of each
2-digit float must be non-zero (consider denormals as a separate case)
the largest magnitude float is uniquely determined. Subtract that from
the many-digit total and then the next largest float is uniquely
determined and so on. There can be at most 1 denormal number in the
result and this is also uniquely determined once the larger numbers
are extracted. So the only way that the partials list can be
non-unique is by the inclusion of zeros (unless I've missed something
:).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19086
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-09-30 Thread Oscar Benjamin

Oscar Benjamin added the comment:

Thanks for looking at this Antoine.

I've attached an updated patch for Python 2.7 called
check_mno_cywin_py27_2.patch. This explicitly closes the popen object
in the same way as the get_versions() function immediately above.

I've just signed an electronic contributor's agreement.

--
Added file: http://bugs.python.org/file31919/check_mno_cywin_py27_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___diff -r a7db9f505e88 Lib/distutils/cygwinccompiler.py
--- a/Lib/distutils/cygwinccompiler.py  Sun Jun 23 16:12:32 2013 -0400
+++ b/Lib/distutils/cygwinccompiler.py  Mon Sep 30 12:01:34 2013 +0100
@@ -319,13 +319,18 @@
 else:
 entry_point = ''
 
-self.set_executables(compiler='gcc -mno-cygwin -O -Wall',
- compiler_so='gcc -mno-cygwin -mdll -O -Wall',
- compiler_cxx='g++ -mno-cygwin -O -Wall',
- linker_exe='gcc -mno-cygwin',
- linker_so='%s -mno-cygwin %s %s'
-% (self.linker_dll, shared_option,
-   entry_point))
+if self.gcc_version  '4' or is_cygwingcc():
+no_cygwin = ' -mno-cygwin'
+else:
+no_cygwin = ''
+
+self.set_executables(compiler='gcc%s -O -Wall' % no_cygwin,
+ compiler_so='gcc%s -mdll -O -Wall' % no_cygwin,
+ compiler_cxx='g++%s -O -Wall' % no_cygwin,
+ linker_exe='gcc%s' % no_cygwin,
+ linker_so='%s%s %s %s'
+% (self.linker_dll, no_cygwin,
+   shared_option, entry_point))
 # Maybe we should also append -mthreads, but then the finished
 # dlls need another dll (mingwm10.dll see Mingw32 docs)
 # (-mthreads: Support thread-safe exception handling on `Mingw32')
@@ -447,3 +452,12 @@
 else:
 dllwrap_version = None
 return (gcc_version, ld_version, dllwrap_version)
+
+def is_cygwingcc():
+'''Try to determine if the gcc that would be used is from cygwin.'''
+out = os.popen('gcc -dumpmachine', 'r')
+out_string = out.read()
+out.close()
+# out_string is the target triplet cpu-vendor-os
+# Cygwin's gcc sets the os to 'cygwin'
+return out_string.strip().endswith('cygwin')
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-09-30 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 30 September 2013 12:08, Oscar Benjamin rep...@bugs.python.org wrote:
 I've attached an updated patch for Python 2.7 called
 check_mno_cywin_py27_2.patch.

To be clear: I retested this patch (using the setup described above)
and the results are unchanged.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19086] Make fsum usable incrementally.

2013-09-30 Thread Oscar Benjamin

Oscar Benjamin added the comment:

Fair enough.

Thanks again for taking the time to look at this.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19086
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19086] Make fsum usable incrementally.

2013-09-28 Thread Oscar Benjamin

Oscar Benjamin added the comment:

Thanks for responding Raymond.

Raymond Hettinger wrote:
 A start argument won't help you, because you will discard information
on input.  A sequence like [1E100, 0.1, -1E100, 0.1] wouldn't work when
split into subtotal=fsum([1E100, 0.1]) and fsum([-1E100, 0.1],
start=subtotal).

I'm not sure if you've fully understood the proposal.

The algorithm underlying fsum is (I think) called distillation. It
compresses a list of floats into a shorter list of non-overlapping floats
having the same exact sum. Where fsum deviates from this idea is that it
doesn't return a list of floats, rather it returns only the largest float.
My proposal is that there be a way to tell fsum to return the list of
floats whose sum is exact and not rounded. Specifically the subtotal that
is returned and fed back into fsum would be a list of floats and no
information would be discarded. So fsum(numbers, []) would return a list of
floats and that list can be passed back in again.

  My motivation for this is that I want to be able to write
  an efficient sum function that is accurate for a mix of ints
  and floats

 FWIW, fsum() already works well with integers as long as they don't
exceed 53bits of precision.

I was simplifying the use-case somewhat. I would also like this sum
function to work with fractions and decimals neither of which coerces
exactly to float (and potentially I want some non-stdlib types also).

  For such exotic use cases, the decimal module would be a better
alternative.  Now that the decimal module has a C implementation, there is
no reason not to use it for high precision applications.

It is possible to coerce to Decimal and exactly sum a list of ints, floats
and decimals (by trapping inexact and increasing the context precision). I
have tried this under CPython 3.3 and it was 20-50x slower than fsum
depending on how it manages the arithmetic context and whether it can be
used safely with a generator that also manipulates the context - the safe
version that doesn't build a list is 50x slower. It is also not possible to
use Decimal exactly with Fractions.

I believe that there are other use-cases for having fsum be usable
incrementally. This would make it usable for accurate summation in
incremental, parallel and distributed computation also. Unfortunately fsum
itself already isn't used as much as it should be and I agree that probably
all the use cases for this extension would be relatively obscure.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19086
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19086] Make fsum usable incrementally.

2013-09-25 Thread Oscar Benjamin

New submission from Oscar Benjamin:

I would like to be able use fsum incrementally however it is not currently 
possible.

With sum() you can do:

subtotal = sum(nums)
subtotal = sum(othernums, subtotal)

This wouldn't work for fsum() because the returned float is not the same as the 
state maintained internally which is a list of floats. I propose instead that a 
fsum() could take an optional second argument which is a list of floats and in 
this case return the updated list of floats.

I have modified Raymond Hettinger's msum() recipe to do what I want:

def fsum(iterable, carry=None):
Full precision summation using multiple floats for intermediate values
partials = list(carry) if carry else []
for x in iterable:
i = 0
for y in partials:
if abs(x)  abs(y):
x, y = y, x
hi = x + y
lo = y - (hi - x)
if lo:
partials[i] = lo
i += 1
x = hi
partials[i:] = [x]
if carry is None:
return sum(partials, 0.0)
else:
return partials


Here's an interactive session showing how you might use it:

 from fsum import fsum
 fsum([1e20, 1, -1e20])
1.0
 fsum([1e20, 1, -1e20], [])
[1.0, 0.0]
 fsum([1e20, 1, -1e20], [])
[1.0, 0.0]
 fsum([1e20, 1, -1e20], [])
[1.0, 0.0]
 fsum([1e20, 1], [])
[1.0, 1e+20]
 carry = fsum([1e20, 1], [])
 fsum([-1e20], carry)
[1.0, 0.0]
 nums = [7, 1e100, -7, -1e100, -9e-20, 8e-20] * 10
 subtotal = []
 for n in nums:
... subtotal = fsum([n], subtotal)
...
 subtotal
[-1.0007e-19]
 fsum(subtotal)
-1.0007e-19


My motivation for this is that I want to be able to write an efficient sum 
function that is accurate for a mix of ints and floats while being as fast as 
possible in the case that I have a list of only floats (or only ints). What I 
have so far looks a bit like:

def sum(numbers):
exact_total = 0
float_total = 0.0
for T, nums in groupby(numbers):
if T is int:
exact_total = sum(nums, exact_total)
elif T is float:
# This doesn't really work:
float_total += fsum(float_total)
# ...


However fsum is only exact if it adds all the numbers in a single pass. The 
above would have order-dependent results given a mixed list of ints and floats 
e.g.:

[1e20, -1e20, 1, -1, 1.0, -1.0]
  vs
[1e20, 1.0, -1e20, 1, -1, -1.0]

Although fsum is internally very accurate it always discards information on 
output. Even if I build a list of all floats and fsum those at the end it can 
still be inaccurate if the exact_total cannot be exactly coerced to float and 
passed to fsum.

If I could get fsum to return the exact float expansion then I could use that 
in a number of different ways. Given the partials list I can combine all of the 
different subtotals with Fraction arithmetic and coerce to float only at the 
very end. If I can also get fsum to accept a float expansion on input then I 
can use it incrementally and there is no need to build a list of all floats.

I am prepared to write a patch for this if the idea is deemed acceptable.

--
components: Library (Lib)
messages: 198381
nosy: mark.dickinson, oscarbenjamin, rhettinger
priority: normal
severity: normal
status: open
title: Make fsum usable incrementally.
type: enhancement
versions: Python 3.4, Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19086
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18821] Add .lastitem attribute to takewhile instances

2013-09-08 Thread Oscar Benjamin

Oscar Benjamin added the comment:

Thank you Claudiu very much for writing a patch; I was expecting to
have to do that myself!

Serhiy, you're right groupby is a better fit for this. It does mean a
bit of reworking for the (more complicated) sum function I'm working
on but I've just checked with timeit and it performs very well using
the type function as a predicate. I think it might make the function a
few times faster than takewhile in my common cases for reasons that
are particular to this problem.

Raymond, thanks for taking the time to consider this. I agree that it
should now be closed.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18821
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18606] Add statistics module to standard library

2013-08-27 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On Aug 28, 2013 1:43 AM, janzert rep...@bugs.python.org wrote:

 Seems that the discussion is now down to implementation issues and the
PEP is at the point of needing to ask python-dev for a PEP dictator?

I would say so. AFAICT Steven has addressed all of the issues that have
been raised. I've read through the module in full and I'm happy with the
API/specification exactly as it now is (including the sum function since
the last patch).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18606
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18821] Add .lastitem attribute to takewhile instances

2013-08-23 Thread Oscar Benjamin

New submission from Oscar Benjamin:

I've often wanted to be able to query a takewhile object to discover the item 
that failed the predicate but the item is currently discarded.

A usage example:

def sum(items):
it = iter(items)
ints = takewhile(Integral.__instancecheck__, it)
subtotal = sum(ints)
if not hasattr(ints.lastitem):
return subtotal
floats = takewhile(float.__instancecheck__, it)
subtotalf = fsum(floats)
if not hasattr(floats.lastitem):
return subtotal + subtotalf
# Deal with more types
...


Loosely what I'm thinking is this but perhaps with different attribute names:


class takewhile(pred, iterable):
def __init__(self):
self.pred = pred
self.iterable = iterable
self.failed = False
def __iter__(self):
for item in self.iterable:
if self.pred(item):
yield item
else:
self.failed = True
self.lastitem = item
return

--
components: Library (Lib)
messages: 195962
nosy: oscarbenjamin
priority: normal
severity: normal
status: open
title: Add .lastitem attribute to takewhile instances
type: enhancement
versions: Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18606] Add statistics module to standard library

2013-08-22 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 22 August 2013 03:43, Steven D'Aprano rep...@bugs.python.org wrote:

 If Oscar is willing, I'd like to discuss some of his ideas off-list, but that 
 may take some time.

I am willing and it will take time.

I've started reading the paper that Raymond Hettinger references for
the algorithm used in his accurate float sum recipe. I'm not sure why
yet but the algorithm is apparently provably exact only for binary
radix floats so isn't appropriate for decimals. It does seem to give
*very* accurate results for decimals though so I suspect the issue is
just about cases that are on the cusp of the rounding mode. In any
case the paper cites a previous work that gives an algorithm that
apparently works for floating point types with arbitrary radix and
exact rounding; it would be good for that to live somewhere in Python
but I haven't had a chance to look at the paper yet.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18606
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-08-21 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I just noticed today that the fix that implemented by these patches
(only providing -mno-cygwin if gcc_ver  4) is also used by numpy's
distutils. You can see the relevant code here:

https://github.com/numpy/numpy/blob/master/numpy/distutils/mingw32ccompiler.py#L117

The relevant commit was three years ago:

https://github.com/numpy/numpy/commit/9dd7c7b8ad826beefbbc0c10ff457c62f1be223d

They haven't bothered with checking for cygwin gcc (my patches only do
this to try and show a helpful error message).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18606] Add statistics module to standard library

2013-08-19 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I've just checked over the new patch and it all looks good to me apart
from one quibble.

It is documented that statistics.sum() will respect rounding errors
due to decimal context (returning the same result that sum() would). I
would prefer it if statistics.sum would use compensated summation with
Decimals since in my view they are a floating point number
representation and are subject to arithmetic rounding error in the
same way as floats. I expect that the implementation of sum() will
change but it would be good to at least avoid documenting this IMO
undesirable behaviour.

So with the current implementation I can do:

 from decimal import Decimal as D, localcontext, Context, ROUND_DOWN
 data = [D(0.1375), D(0.2108), D(0.3061), D(0.0419)]
 print(statistics.variance(data))
0.0125290958333
 with localcontext() as ctx:
... ctx.prec = 2
... ctx.rounding = ROUND_DOWN
... print(statistics.variance(data))
...
0.010

The final result is not accurate to 2 d.p. rounded down. This is
because the decimal context has affected all intermediate computations
not just the final result. Why would anyone prefer this behaviour over
an implementation that could compensate for rounding errors and return
a more accurate result?

If statistics.sum and statistics.add_partial are modified in such a
way that they use the same compensated algorithm for Decimals as they
would for floats then you can have the following:

 statistics.sum([D('-1e50'), D('1'), D('1e50')])
Decimal('1')

whereas it currently does:

 statistics.sum([D('-1e50'), D('1'), D('1e50')])
Decimal('0E+23')
 statistics.sum([D('-1e50'), D('1'), D('1e50')]) == 0
True

It still doesn't fix the variance calculation but I'm not sure exactly
how to do better than the current implementation for that. Either way
though I don't think the current behaviour should be a documented
guarantee. The meaning of honouring the context implies using a
specific sum algorithm, since an alternative algorithm would give a
different result and I don't think you should constrain yourself in
that way.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18606
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18606] Add statistics module to standard library

2013-08-19 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 19 August 2013 17:35, Steven D'Aprano rep...@bugs.python.org wrote:

 Steven D'Aprano added the comment:

 On 19/08/13 23:15, Oscar Benjamin wrote:

 The final result is not accurate to 2 d.p. rounded down. This is
 because the decimal context has affected all intermediate computations
 not just the final result.

 Yes. But that's the whole point of setting the context to always round down. 
 If summation didn't always round down, it would be a bug.

If individual binary summation (d1 + d2) didn't round down then that
would be a bug.

 If you set the precision to a higher value, you can avoid the need for 
 compensated summation. I'm not prepared to pick and choose which contexts 
 I'll honour. If I honour those with a high precision, I'll honour those with 
 a low precision too. I'm not going to check the context, and if it is too 
 low (according to whom?) set it higher.

I often write functions like this:

def compute_stuff(x):
with localcontext() as ctx:
 ctx.prec +=2
 y = ... # Compute in higher precision
return +y  # __pos__ reverts to the default precision

The final result is rounded according to the default context but the
intermediate computation is performed in such a way that the final
result is (hopefully) correct within its context. I'm not proposing
that you do that, just that you don't commit to respecting inaccurate
results.

Why would anyone prefer this behaviour over
 an implementation that could compensate for rounding errors and return
 a more accurate result?

 Because that's what the Decimal standard requires (as I understand it), and 
 besides you might be trying to match calculations on some machine with a 
 lower precision, or different rounding modes. Say, a pocket calculator, or a 
 Cray, or something. Or demonstrating why rounding matters.

No that's not what the Decimal standard requires. Okay I haven't fully
read it but I am familiar with these standards and I've read a good
bit of IEEE-754. The standard places constrainst on low-level
arithmetic operations that you as an implementer of high-level
algorithms can use to ensure that your code is accurate.

Following your reasoning above I should say that math.fsum and your
statistics.sum are both in violation of IEEE-754 since
fsum([a, b, c, d, e])
is not equivalent to
a+b)+c)+d)+e)
under the current rounding scheme. They are not in violation of the
standard: both functions use the guarantees of the standard to
guarantee their own accuracy. Both go to some lengths to avoid
producing output with the rounding errors that sum() would produce.

 I think the current behaviour is the right thing to do, but I appreciate the 
 points you raise. I'd love to hear from someone who understands the Decimal 
 module better than I do and can confirm that the current behaviour is in the 
 spirit of the Decimal module.

I use the Decimal module for multi-precision real arithmetic. That may
not be the typical use-case but to me Decimal is a floating point type
just like float. Precisely the same reasoning that leads to fsum
applies to Decimal just as it does to float

(BTW I've posted on Rayomnd Hettinger's recipe a modification that
might make it work for Decimal but no reply yet.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18606
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18606] Add statistics module to standard library

2013-08-12 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 12 August 2013 20:20, Steven D'Aprano rep...@bugs.python.org wrote:
 On 12/08/13 19:21, Mark Dickinson wrote:
 About the implementation of sum:
 add_partial is no longer documented as a public function, so I'm open to 
 switching algorithms in the future.

Along similar lines it might be good to remove the doc-test for using
decimal.ROUND_DOWN. I can't see any good reason for anyone to want
that behaviour when e.g. computing the mean() whereas I can see
reasons for wanting to reduce rounding error for decimal in
statistics.sum. It might be a good idea not to tie yourself to the
guarantee implied by that test.

I tried an alternative implementation of sum() that can also reduce
rounding error with decimals but it failed that test (by making the
result more accurate). Here's the sum() I wrote:

def sum(data, start=0):

if not isinstance(start, numbers.Number):
raise TypeError('sum only accepts numbers')

inexact_types = (float, complex, decimal.Decimal)
def isexact(num):
return not isinstance(num, inexact_types)

if isexact(start):
exact_total, inexact_total = start, 0
else:
exact_total, inexact_total = 0, start

carrybits = 0

for x in data:
if isexact(x):
exact_total = exact_total + x
else:
new_inexact_total = inexact_total + (x + carrybits)
carrybits = -(((new_inexact_total - inexact_total) - x) - carrybits)
inexact_total = new_inexact_total

return (exact_total + inexact_total) + carrybits

It is more accurate for e.g. the following:
nums = [decimal.Decimal(10 ** n) for n in range(50)]
nums += [-n for n in reversed(nums)]
assert sum(nums) == 0

However there will also be other situations where it is less accurate such as
print(sum([-1e30, +1e60, 1, 3, -1e60, 1e30]))
so it may not be suitable as-is.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18606
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18606] Add statistics module to standard library

2013-08-09 Thread Oscar Benjamin

Oscar Benjamin added the comment:

One small point:

I think that the argument `m` to variance, pvariance, stdev and pstdev
should be renamed to `mu` for pvariance/pstdev and `xbar` for
variance/stdev. The doc-strings should carefully distinguish that `mu`
is the true/population mean and `xbar` is the estimated/sample mean
and refer to this difference between the function variants.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18606
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18606] Add statistics module to standard library

2013-08-06 Thread Oscar Benjamin

Changes by Oscar Benjamin oscar.j.benja...@gmail.com:


--
nosy: +oscarbenjamin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18606
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18305] [patch] Fast sum() for non-numbers

2013-07-11 Thread Oscar Benjamin

Oscar Benjamin added the comment:

This optimisation is a semantic change. It breaks backward compatibility in 
cases where a = a + b and a += b do not result in the name a having the same 
value. In particular this breaks backward compatibility for numpy users.

Numpy arrays treat += differently from + in the sense that a += b coerces b to 
the same dtype as a and then adds in place whereas a + b uses Python style type 
promotion. This behaviour is by design and it is useful. It is also entirely 
appropriate (unlike e.g. summing lists) that someone would use sum() to add 
numpy arrays.

An example where + and += give different results:

 from numpy import array
 a1 = array([1, 2, 3], dtype=int)
 a1
array([1, 2, 3])
 a2 = array([.5, .5, .5], dtype=float)
 a2
array([ 0.5,  0.5,  0.5])
 a1 + a2
array([ 1.5,  2.5,  3.5])
 a1 += a2
 a1
array([1, 2, 3])

--
nosy: +oscarbenjamin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18305
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-07-11 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I'm attaching three new patches following on from Eric and Christian's
suggestions:

check_mno_cywin_py27_1.patch (for Python 2.7)
check_mno_cywin_py3_1.patch (for Python 3.2 and 3.3)
check_mno_cywin_py34_1.patch (for Python 3.4)

The py27 patch now uses os.popen to avoid importing subprocess as
suggested by Eric. The other two patches are changed to use
check_output as suggested by Christian (subprocess is already imported
in 3.x).

I've retested the patches using the same setup as before and the
results are unchanged for all gcc and Python versions tested.

--
Added file: http://bugs.python.org/file30888/check_mno_cywin_py34_1.patch
Added file: http://bugs.python.org/file30889/check_mno_cywin_py3_1.patch
Added file: http://bugs.python.org/file30890/check_mno_cywin_py27_1.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___diff -r 7aab60b70f90 Lib/distutils/cygwinccompiler.py
--- a/Lib/distutils/cygwinccompiler.py  Sun Jun 23 15:47:03 2013 -0700
+++ b/Lib/distutils/cygwinccompiler.py  Thu Jul 11 16:59:27 2013 +0100
@@ -48,13 +48,14 @@
 import os
 import sys
 import copy
-from subprocess import Popen, PIPE
+from subprocess import Popen, PIPE, check_output
 import re
 
 from distutils.ccompiler import gen_preprocess_options, gen_lib_options
 from distutils.unixccompiler import UnixCCompiler
 from distutils.file_util import write_file
-from distutils.errors import DistutilsExecError, CompileError, UnknownFileError
+from distutils.errors import (DistutilsExecError, CCompilerError,
+CompileError, UnknownFileError)
 from distutils import log
 from distutils.version import LooseVersion
 from distutils.spawn import find_executable
@@ -294,11 +295,15 @@
 else:
 entry_point = ''
 
-self.set_executables(compiler='gcc -mno-cygwin -O -Wall',
- compiler_so='gcc -mno-cygwin -mdll -O -Wall',
- compiler_cxx='g++ -mno-cygwin -O -Wall',
- linker_exe='gcc -mno-cygwin',
- linker_so='%s -mno-cygwin %s %s'
+if is_cygwingcc():
+raise CCompilerError(
+'Cygwin gcc cannot be used with --compiler=mingw32')
+
+self.set_executables(compiler='gcc -O -Wall',
+ compiler_so='gcc -mdll -O -Wall',
+ compiler_cxx='g++ -O -Wall',
+ linker_exe='gcc',
+ linker_so='%s %s %s'
 % (self.linker_dll, shared_option,
entry_point))
 # Maybe we should also append -mthreads, but then the finished
@@ -393,3 +398,8 @@
 
 commands = ['gcc -dumpversion', 'ld -v', 'dllwrap --version']
 return tuple([_find_exe_version(cmd) for cmd in commands])
+
+def is_cygwingcc():
+'''Try to determine if the gcc that would be used is from cygwin.'''
+out_string = check_output(['gcc', '-dumpmachine'])
+return out_string.strip().endswith(b'cygwin')
diff -r 0762f2419494 Lib/distutils/cygwinccompiler.py
--- a/Lib/distutils/cygwinccompiler.py  Sun Jun 23 23:51:44 2013 +0200
+++ b/Lib/distutils/cygwinccompiler.py  Thu Jul 11 17:05:05 2013 +0100
@@ -48,7 +48,7 @@
 import os
 import sys
 import copy
-from subprocess import Popen, PIPE
+from subprocess import Popen, PIPE, check_output
 import re
 
 from distutils.ccompiler import gen_preprocess_options, gen_lib_options
@@ -294,13 +294,18 @@
 else:
 entry_point = ''
 
-self.set_executables(compiler='gcc -mno-cygwin -O -Wall',
- compiler_so='gcc -mno-cygwin -mdll -O -Wall',
- compiler_cxx='g++ -mno-cygwin -O -Wall',
- linker_exe='gcc -mno-cygwin',
- linker_so='%s -mno-cygwin %s %s'
-% (self.linker_dll, shared_option,
-   entry_point))
+if self.gcc_version  '4' or is_cygwingcc():
+no_cygwin = ' -mno-cygwin'
+else:
+no_cygwin = ''
+
+self.set_executables(compiler='gcc%s -O -Wall' % no_cygwin,
+ compiler_so='gcc%s -mdll -O -Wall' % no_cygwin,
+ compiler_cxx='g++%s -O -Wall' % no_cygwin,
+ linker_exe='gcc%s' % no_cygwin,
+ linker_so='%s%s %s %s'
+% (self.linker_dll, no_cygwin,
+   shared_option, entry_point))
 # Maybe we should also append -mthreads, but then the finished
 # dlls need another dll (mingwm10.dll see Mingw32 docs)
 # (-mthreads: Support thread-safe exception handling on `Mingw32')
@@ -393,3 +398,8

[issue12641] Remove -mno-cygwin from distutils

2013-07-09 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 9 July 2013 16:25, Christian Heimes rep...@bugs.python.org wrote:

 The is_cygwingcc() function can be simplified a lot with 
 subprocess.check_output().

My initial thought was to do that but then I based it on
_find_exe_version which for whatever reason uses Popen directly [1].
I'm happy to make that change and retest the patches although I can't
do it right now.

Can someone first accept or reject the general idea of the patches
though? I'm happy to answer any questions about them but it takes time
to get the diffs right and test against all compilers and Python
versions and I don't really want to do it if the patches will just be
rejected.

Also I may soon lose access to the machine that I used to write and
test these patches. If it is desired for me to change and retest them
it may not be possible after two weeks or so.

[1] 
http://hg.python.org/cpython/file/3f3cbfd52f94/Lib/distutils/cygwinccompiler.py#l368

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-07-09 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 9 July 2013 17:36, Éric Araujo rep...@bugs.python.org wrote:

 Don’t forget that distutils is used during CPython’s build process to compile 
 extension modules: subprocess may not be importable then.

Subprocess is imported at at the top of the module in 3.x [1]. The
whole distutils.cygwinccompiler module is an ImportError if subprocess
is not importable.

Or did you mean for 2.7 only (where get_versions() uses os.popen)?

[1] 
http://hg.python.org/cpython/file/3f3cbfd52f94/Lib/distutils/cygwinccompiler.py#l51

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-06-25 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I'm attaching one more patch check_mno_cywin_py34.patch. This is my
preferred patch for Python 3.4 (default). It fixes building with MinGW
and removes all support for using Cygwin gcc with --compiler=mingw32.
The user would see the following error message:

'''
Q:\current\testing\hellotestbuild q:\tools\cygwin\bin -3.3
running build_ext
error: Cygwin gcc cannot be used with --compiler=mingw32
'''

I think that this is reasonable as '-mno-cygwin' is a previously
experimental and now long deprecated, discouraged and discontinued
feature of Cygwin's gcc. Removing support for it in future Pythons
would make problems involving MinGW build (like this one) much easier
to solve in future: there would be no need to consider anything other
than the behaviour of MinGW's gcc.

--
Added file: http://bugs.python.org/file30698/check_mno_cywin_py34.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___diff -r 7aab60b70f90 Lib/distutils/cygwinccompiler.py
--- a/Lib/distutils/cygwinccompiler.py  Sun Jun 23 15:47:03 2013 -0700
+++ b/Lib/distutils/cygwinccompiler.py  Tue Jun 25 11:38:05 2013 +0100
@@ -54,7 +54,8 @@
 from distutils.ccompiler import gen_preprocess_options, gen_lib_options
 from distutils.unixccompiler import UnixCCompiler
 from distutils.file_util import write_file
-from distutils.errors import DistutilsExecError, CompileError, UnknownFileError
+from distutils.errors import (DistutilsExecError, CCompilerError,
+CompileError, UnknownFileError)
 from distutils import log
 from distutils.version import LooseVersion
 from distutils.spawn import find_executable
@@ -294,11 +295,15 @@
 else:
 entry_point = ''
 
-self.set_executables(compiler='gcc -mno-cygwin -O -Wall',
- compiler_so='gcc -mno-cygwin -mdll -O -Wall',
- compiler_cxx='g++ -mno-cygwin -O -Wall',
- linker_exe='gcc -mno-cygwin',
- linker_so='%s -mno-cygwin %s %s'
+if is_cygwingcc():
+raise CCompilerError(
+'Cygwin gcc cannot be used with --compiler=mingw32')
+
+self.set_executables(compiler='gcc -O -Wall',
+ compiler_so='gcc -mdll -O -Wall',
+ compiler_cxx='g++ -O -Wall',
+ linker_exe='gcc',
+ linker_so='%s %s %s'
 % (self.linker_dll, shared_option,
entry_point))
 # Maybe we should also append -mthreads, but then the finished
@@ -393,3 +398,15 @@
 
 commands = ['gcc -dumpversion', 'ld -v', 'dllwrap --version']
 return tuple([_find_exe_version(cmd) for cmd in commands])
+
+def is_cygwingcc():
+'''Try to determine if the gcc that would be used is from cygwin.'''
+from subprocess import Popen, PIPE
+out = Popen(['gcc', '-dumpmachine'], shell=True, stdout=PIPE).stdout
+try:
+out_string = out.read()
+finally:
+out.close()
+# out_string is the target triplet cpu-vendor-os
+# Cygwin's gcc sets the os to 'cygwin'
+return out_string.decode('ascii').strip().endswith('cygwin')
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-06-24 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 24 June 2013 09:07, Marc-Andre Lemburg rep...@bugs.python.org wrote:

 Could someone perhaps produce a single final patch file which can
 be applied to Python 2.7 and 3.2+ ?

I've attached two patches check_mno_cywin_py27.patch for Python 2.7
and check_mno_cywin_py3.patch for Python 3.2 and 3.3. The changes
are identical but the 2.7 patch didn't apply cleanly against 3.x. I'll
upload the files used to test the patches in test_mno_cygwin.tar.gz.

The patches are as I described previously and check the output of 'gcc
-dumpmachine' to see if the gcc on PATH is from cygwin. With the patch
'-mno-cygwin' will be passed if gcc version  4 or the gcc is from
cygwin. Otherwise it will not be passed.

I've tested with versions:
Python 2.7.5, 3.2.5 and 3.3.2
MinGW gcc 4.7.2
Cygwin gcc 3.4.4 and 4.5.3

The results of the patch are the same for all versions of Python tested:
Cygwin gcc 3.x - still works
Cygwin gcc 4.x - still doesn't work (same error message)
MinGW gcc 4.7  - fixed after the patch

This patch does not attempt to add support for the newer (gcc 4.x)
Cygwin cross-compilers. I have experimented with what it would take to
have those work and it is something like:

if is_cygwingcc() and version = 4:
platform = platform_map[get_platform()]
use platform + '-pc-cygwin-gcc' as gcc
use platform + '-pc-cygwin-g++' as g++
etc.

Then there would also need to modifications to the linker settings to
fix the problem that Martin mentioned (a long way above) that it would
link against the wrong MSVC runtime. I started writing the patch to do
these things as well as fix MinGW support and it became more and more
of a mess. I don't think that distutils should be trying to guess
whether or not people intended to use the Cygwin cross-compilers. If
these are to be supported then they should have a new
--compiler=cygwin-cross and a separate subclass of CygwinCCompiler to
avoid more issues like this one arising in the future.

Oscar

--
Added file: http://bugs.python.org/file30681/check_mno_cywin_py27.patch
Added file: http://bugs.python.org/file30682/check_mno_cywin_py3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___diff -r a7db9f505e88 Lib/distutils/cygwinccompiler.py
--- a/Lib/distutils/cygwinccompiler.py  Sun Jun 23 16:12:32 2013 -0400
+++ b/Lib/distutils/cygwinccompiler.py  Mon Jun 24 12:03:15 2013 +0100
@@ -319,13 +319,18 @@
 else:
 entry_point = ''
 
-self.set_executables(compiler='gcc -mno-cygwin -O -Wall',
- compiler_so='gcc -mno-cygwin -mdll -O -Wall',
- compiler_cxx='g++ -mno-cygwin -O -Wall',
- linker_exe='gcc -mno-cygwin',
- linker_so='%s -mno-cygwin %s %s'
-% (self.linker_dll, shared_option,
-   entry_point))
+if self.gcc_version  '4' or is_cygwingcc():
+no_cygwin = ' -mno-cygwin'
+else:
+no_cygwin = ''
+
+self.set_executables(compiler='gcc%s -O -Wall' % no_cygwin,
+ compiler_so='gcc%s -mdll -O -Wall' % no_cygwin,
+ compiler_cxx='g++%s -O -Wall' % no_cygwin,
+ linker_exe='gcc%s' % no_cygwin,
+ linker_so='%s%s %s %s'
+% (self.linker_dll, no_cygwin,
+   shared_option, entry_point))
 # Maybe we should also append -mthreads, but then the finished
 # dlls need another dll (mingwm10.dll see Mingw32 docs)
 # (-mthreads: Support thread-safe exception handling on `Mingw32')
@@ -447,3 +452,16 @@
 else:
 dllwrap_version = None
 return (gcc_version, ld_version, dllwrap_version)
+
+
+def is_cygwingcc():
+'''Try to determine if the gcc that would be used is from cygwin.'''
+from subprocess import Popen, PIPE
+out = Popen(['gcc', '-dumpmachine'], shell=True, stdout=PIPE).stdout
+try:
+out_string = out.read()
+finally:
+out.close()
+# out_string is the target triplet cpu-vendor-os
+# Cygwin's gcc sets the os to 'cygwin'
+return out_string.strip().endswith('cygwin')
diff -r b9b521efeba3 Lib/distutils/cygwinccompiler.py
--- a/Lib/distutils/cygwinccompiler.py  Sat May 18 17:56:42 2013 +0200
+++ b/Lib/distutils/cygwinccompiler.py  Mon Jun 24 12:20:07 2013 +0100
@@ -291,13 +291,18 @@
 else:
 entry_point = ''
 
-self.set_executables(compiler='gcc -mno-cygwin -O -Wall',
- compiler_so='gcc -mno-cygwin -mdll -O -Wall',
- compiler_cxx='g++ -mno-cygwin -O -Wall',
- linker_exe='gcc -mno-cygwin

[issue12641] Remove -mno-cygwin from distutils

2013-06-24 Thread Oscar Benjamin

Changes by Oscar Benjamin oscar.j.benja...@gmail.com:


Added file: http://bugs.python.org/file30683/test_mno_cygwin.tar.gz

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-06-24 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 24 June 2013 12:53, Oscar Benjamin rep...@bugs.python.org wrote:
 The changes
 are identical but the 2.7 patch didn't apply cleanly against 3.x. I'll
 upload the files used to test the patches in test_mno_cygwin.tar.gz.

Correction: the patches are not quite identical as the py3 patch
decodes the output of the subprocess as ascii.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18129] Fatal Python error: Cannot recover from stack overflow.

2013-06-03 Thread Oscar Benjamin

New submission from Oscar Benjamin:

This is from a thread on python-list that started here:
http://mail.python.org/pipermail/python-list/2013-May/647895.html

There are situations in which the Python 3.2 and 3.3 interpreters crash with 
Fatal Python error: Cannot recover from stack overflow.
when I believe the correct response is a RuntimeError (as happens in 2.7). I've 
attached a file crash.py that demonstrates the problem.

The following gives the same behaviour in 2.7, 3.2 and 3.3:

$ cat tmp.py
def loop():
loop()

loop()

$ py -3.2 tmp.py
Traceback (most recent call last):
  File tmp.py, line 4, in module
loop()
  File tmp.py, line 2, in loop
loop()
  File tmp.py, line 2, in loop
loop()
  File tmp.py, line 2, in loop
loop()
  File tmp.py, line 2, in loop
...

However the following leads to a RuntimeError in 2.7 but different
fatal stack overflow errors in 3.2 and 3.3 (tested on Windows XP using 32-bit 
python.org installers):

$ cat tmp.py
def loop():
try:
(lambda: None)()
except RuntimeError:
pass
loop()

loop()

$ py -2.7 tmp.py
Traceback (most recent call last):
  File tmp.py, line 8, in module
loop()
  File tmp.py, line 6, in loop
loop()
  File tmp.py, line 6, in loop
loop()
  File tmp.py, line 6, in loop
loop()
  File tmp.py, line 6, in loop
...
RuntimeError: maximum recursion depth exceeded

$ py -3.2 tmp.py
Fatal Python error: Cannot recover from stack overflow.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

$ py -3.3 tmp.py
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x05c4:
  File tmp.py, line 3 in loop
  File tmp.py, line 6 in loop
  File tmp.py, line 6 in loop
  File tmp.py, line 6 in loop
  File tmp.py, line 6 in loop
  File tmp.py, line 6 in loop
  File tmp.py, line 6 in loop
...

Also tested on stock Python 3.2.3 on Ubuntu (2.7 gives RuntimeError):

$ python3 tmp.py 
Fatal Python error: Cannot recover from stack overflow.
Aborted (core dumped)


I would expect this to give RuntimeError: maximum recursion depth
exceeded in all cases.


Oscar

--
components: Interpreter Core
files: crash.py
messages: 190568
nosy: oscarbenjamin
priority: normal
severity: normal
status: open
title: Fatal Python error: Cannot recover from stack overflow.
type: crash
versions: Python 3.2, Python 3.3
Added file: http://bugs.python.org/file30458/crash.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18129
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-05-25 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 25 May 2013 04:43, Renato Silva rep...@bugs.python.org wrote:

 Renato Silva added the comment:

 Hi Oscar! Sorry, I just meant to correct this information: in gcc 4.x it 
 produces an error preventing build. Even if it doesn't do anything useful, 
 still GCC 4.4 does accept that option normally. If MinGW didn't touch 
 anything relevant, then what Cygwin folks said about 4.x [1] clearly did not 
 come to reality.

In context it should be clear that the statement  in gcc 4.x it
produces an error preventing build refers to Cygwin's gcc and not
MinGW's. Which gcc are you referring to?

 No the developer does not confirm that the -mno-cygin option is
 required for MinGW.

 Not for MinGW, but for building Pidgin. I have just checked it, and 
 -mno-cygwin actually is no longer necessary since 2.10.7 [1], but it was at 
 the time of that message. Even though it didn't do anything meaningful, a GCC 
 like 4.6 would cause build to fail.

Yes gcc 4.6 would fail because it won't accept the -mno-cygwin option.
That does not mean that any other MinGW gcc ever *required* the
-mno-cygwin option for anything. The MinGW devs have repeatedly and
explicitly stated that the -mno-cygwin option never did anything
useful when used with MinGW:

http://permalink.gmane.org/gmane.comp.gnu.mingw.user/42097
http://permalink.gmane.org/gmane.comp.gnu.mingw.user/42101
http://permalink.gmane.org/gmane.comp.gnu.mingw.user/42104

 Also from what I've seen I would say that the error message that
 the OP shows there comes from Cygwin's gcc not MinGW.

 No, you can use either Cygwin or MinGW MSYS as environment, but the compiler 
 must be MinGW [2].

Yes but that particular error message is coming from Cygwin's gcc not
MinGW. As stated by the Pidgin dev in that message the OP does not
know which compiler they are using:
http://pidgin.im/pipermail/support/2011-December/011159.html

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-05-24 Thread Oscar Benjamin

Oscar Benjamin added the comment:

 Renato Silva added the comment:

 I must note that GCC 4.x *does* support -mno-cygwin, at least until 4.4,
and at least the MinGW version.

MinGW has never supported the -mno-cygwin option. It has simply tolerated
it. The option never did anything useful and at some point it became an
error to even supply it. I'm not sure exactly when but some time after 4.4
sounds reasonable to me.

The option was only ever meaningful in cygwin's gcc 3.x and was always an
error in 4.x.

 I have used it myself for building Pidgin under Windows, which requires
that option. See [1] where a Pidgin developer confirms that.


No the developer does not confirm that the -mno-cygin option is required
for MinGW. Also from what I've seen I would say that the error message that
the OP shows there comes from Cygwin's gcc not MinGW.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-05-23 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I have written a function that can be used to determine if the gcc
that distutils will use is from Cygwin or MinGW:

def is_cygwingcc():
'''Try to determine if the gcc that would be used is from cygwin.'''
out = Popen(['gcc', '-dumpmachine'], shell=True, stdout=PIPE).stdout
try:
out_string = out.read()
finally:
out.close()
# out_string is the target triplet cpu-vendor-os
# Cygwin's gcc sets the os to 'cygwin'
return out_string.strip().endswith('cygwin')

The idea is that 'gcc -dumpmachine' emits a string that always ends in
'cygwin' for the Cygwin gcc (please let me know if I'm wrong about
that). Earnie Boyd at mingw-users described this method for
distinguishing MinGW and Cygwin gcc as not being a bad idea:
http://permalink.gmane.org/gmane.comp.gnu.mingw.user/42137

With this the Mingw32CCompiler.__init__ method can be modified to do:

if self.gcc_version  '4' or is_cygwingcc():
no_cygwin = ' -mno-cygwin'
else:
no_cygwin = ''

self.set_executables(compiler='gcc%s -O -Wall' % no_cygwin,
 compiler_so='gcc%s -mdll -O -Wall' % no_cygwin,
 compiler_cxx='g++%s -O -Wall' % no_cygwin,
 linker_exe='gcc%s' % no_cygwin,
 linker_so='%s%s %s %s'
% (self.linker_dll, no_cygwin,
   shared_option, entry_point))

This will fix the problem for MinGW, should not break existing
no-cygwin/gcc 3.x setups and preserves the error message currently
seen for no-cygwin with gcc 4.x. In other words it should satisfy
users in all three groups A, B and C referred to above. In particular
the is_cygwingcc() function hopefully addresses Martin's concern for
users in group C.

Is this approach acceptable?

Thanks,
Oscar

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-05-22 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 22 May 2013 12:43, Martin v. Löwis rep...@bugs.python.org wrote:
 Am 21.05.13 23:14, schrieb Oscar Benjamin:
 More generally I think that compiling non-cygwin extensions with
 cygwin gcc should be altogether deprecated (for Python 3.4 at least).
 It should be discouraged in the docs and unsupported in the future.

 I agree with that,

Excellent.

 although I find it sad that the Cygwin project
 apparently abandoned support for building Mingw binaries.

I don't understand their reasoning but given the scorn poured on to
-mno-cygwin from at least some people I trust that they had some good
reason :)

Also they have replaced it with something that they consider more
appropriate (the cross-compilers).

 It can only work with -mno-cygwin

 This is factually incorrect. It also works with the i686-pc-mingw32-gcc
 executable, which (IIUC) is still available for Cygwin.

I should have been slightly clearer. It can only currently work in
distutils with -mno-cygwin. The executable you refer to is part of
cygwin gcc's cross-compiler toolchain. This is their recommended
replacement for -mno-cygwin (if not mingw) but is AFAICT unsupported
by distutils.

I think there's a case for saying that distutils should support these
but it should only be done with a new UnixCCompiler subclass and a new
--compiler entry point. It should also perhaps provide a way to
specify the --host since I think that facility is part of the purpose
of the new toolchain.

In any case cygwin cross-compiler support should not be conflated in
the codebase with distutils' mingw support and if it is to be added
that should be discussed in a separate issue. I personally don't think
I would use it and would not push for the support to be added.

Going back to the group C users: I think that it should be possible to
create an is_cygwingcc() function that would parse the output of 'gcc
--version'. Then Mingw32CCompiler.__init__ could do:

if is_cygwingcc() and self.gcc_version = '4':
raise RuntimeError('No cygwin mode only works with gcc-3. Use
gcc-3 or mingw')

The is_cygwingcc() function can be conservative since false positives
or more of a problem than false negatives. I think this should address
your concern.

However on further reflection I'm a little reluctant to force an error
if I can't *prove* that the setup is broken. I'm no stranger to
monkey-patching distutils and it's possible that someone has already
monkey-patched it to make some bizarre setup just about work. I would
be a little peeved if my setup broke in a bugfix release simply
because someone else who didn't understand it decided that it wasn't
viable. (The same monkey-patching concerns apply to the other changes
but I think that fixing the non-monkey-patched setup for mingw trumps
in that case.) So perhaps the best place to deal with the
gcc-4/no-cygwin issue is in the distutils docs.

My updated proposal is (I'll write patches if this is acceptable):

Python 3.4:
Remove '-mno-cygwin'. This breaks the no-cygwin mode and fixes the
mingw mode. The distutils docs are updated with something like:
'''
Note: Previous Python versions supported another 'no-cygwin' mode that
could use cygwin gcc to build extensions without a dependency on
cygwin.dll. This is no longer supported.

New in Python 3.4: No-cygwin mode is no longer supported.
'''

Python 2.7, 3.2 and 3.3:
Only use '-mno-cygwin' if self.gcc_version  '4'. This should not
break any currently functioning setups (barring serious
monkey-patching). The distutils docs are updated with something like:
'''
Note: The no-cygwin mode only works with cygwin's gcc-3. For gcc-4 it
may produce .pyd files with dependencies on cygwin.dll that are not
fully redistributable. The use of no-cygwin mode is deprecated by
cygwin and support for it is removed in Python 3.4.
'''

If you would rather have the is_cygwingcc() check I'm happy to put
that in also if it gets this issue moving but I'm personally cautious
about it.

Thanks,
Oscar

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-05-22 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 22 May 2013 13:40, Oscar Benjamin rep...@bugs.python.org wrote:

 However on further reflection I'm a little reluctant to force an error
 if I can't *prove* that the setup is broken.

After a little more reflection I realise that we could just do:

if self.gcc_version  '4' or is_cygwingcc():
# use -mno-cygwin

This way the cygwin/gcc-4 error is still emitted only if gcc emits it.
If the is_cygwingcc() function is conservative then there could be
cases where it mistakenly does not use -mno-cygwin but that would have
to be a broken cygwin/gcc-4 setup anyway.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-05-21 Thread Oscar Benjamin

Oscar Benjamin added the comment:

I'd really like to get a resolution on this issue so I've tried to gather some 
more information about this problem by asking some questions in the mingw-users 
mailing list. The resulting thread can be found here:
http://comments.gmane.org/gmane.comp.gnu.mingw.user/42092

This issue concerns users of distutils --compiler=mingw32 mode. Normally this 
is used to build with mingw gcc in which case it is currently broken because of 
-mno-cygwin. It has been suggested above that some may be using it to build 
with cygwin's gcc. The -mno-cygwin option is there for those using cygwin's gcc.

To summarise the points (see the thread on mingw-users for more info):
1) -mno-cygwin has never had a meaningful effect in mingw.
2) -mno-cygwin produces an error in recent (~2011 onwards) mingw releases and 
will do for all future releases. This prevents distutils from building 
extensions with mingw.
3) -mno-cygwin only ever had a meaningful effect for cygwin's gcc 3.x where it 
could be used to build binaries that did not depend in cygwin.dll.
4) -mno-cygwin was always considered an experimental feature and its use is 
discouraged.
5) -mno-cygwin was removed from cygwin's gcc in the transition from 3.x to 4.x 
without any deprecation period (as it was an experimental feature). In gcc 4.x 
it produces an error preventing build.
6) The recommended way to replace the -mno-cygwin option is either to use mingw 
or to use cygwin's cross-compilers.

So there are two types of breakage affected by -mno-cygwin:

A: Anyone trying to use recent and future mingw versions to build extensions 
with distutils in the way that is described in the distutils docs. For this 
group distutils has been broken for 2 years and will continue to be until the 
-mno-cygwin option is removed.

B: Anyone who is using distutils with --compiler=mingw32 but using cygwin's gcc 
3.x instead of mingw's gcc to build an extension for a non-cygwin Python. For 
this group removing the -mno-cygwin option would result in unusable extension 
modules. (the resulting .pyd requires cygwin.dll but is to be used in a 
non-cygwin Python).

Firstly, note that users in group B must surely be a group that diminishes with 
time since they are using the legacy gcc 3.x cygwin compiler. Similarly, since 
neither of mingw or cygwin will ever bring back -mno-cygwin users in group A 
will only increase with time. (I am such a user and have been manually removing 
all reference to -mno-cygwin from distutils for 2 years now.) This means that 
the balance of breakage will only move towards group A over time.

Secondly, any users in group B will suffer the same problem as users in group A 
if they try to use gcc 4.x. However this has not been reported on the tracker 
(I read through all matches in a search for '-mno-cygwin'). I think this serves 
as an indication of how many people are actually using this setup.

Thirdly, the -mno-cygwin option is a now-abandoned, experimental feature of a 
legacy compiler. Its creators did not feel the need to give it a deprecation 
period and its use is discouraged by both mingw and cygwin.

Bringing these points together: not removing -mno-cygwin from distutils trades 
the possible breakage for possibly non-existent users of the obscure, legacy, 
and generally considered broken setup B against the definite, known breakage 
for users of the appropriate documented setup A. I think this should be enough 
to say that the fix for the next Python version should simply remove all 
reference to '-mno-cygwin' as I have been doing for 2 years now without 
problem. The only users who can be adversely affected by this are those in 
group B who decide to upgrade to non-cygwin Python 3.4 (while still using an 
ancient cygwin gcc to build extensions). The suggested fix can be either to use 
mingw or to setup the cross-compilers in their cygwin installation.

For Python 2.7, 3.2 and 3.3 I think that this should be considered a bug that 
can be fixed in a bugfix release. However in that case it may be considered 
inappropriate to risk the small possibility of users in group B experiencing 
breakage. Since such users must be using cygwin's gcc 3.x, I propose that 
distutils check the gcc version and only add '-mno-cygwin' if the major version 
is 3. This will not adversely affect users in group B and will fix the problem 
for users in group A. Users in group B who attempt to use gcc 4.x will find 
that they get a different error message (at import time instead of build time) 
but that their setup will still be just as broken as it was before this change.

Thanks,
Oscar

--
nosy: +oscarbenjamin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12641] Remove -mno-cygwin from distutils

2013-05-21 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 21 May 2013 17:21, Martin v. Löwis rep...@bugs.python.org wrote:

 C: Users who have only cygwin gcc 4.x installed

 For those, the current setup will produce an error message, essentially 
 telling them that the need to fix something (specifically: edit distutils, 
 install mingw). With the proposed change, --compiler=mingw32 will produce a 
 binary, but the binary will incorrectly depend on cygwin. They may not notice 
 on their local system (since cygwin.dll is available), but only on customer 
 systems.

Well there cannot be anyone in group C who currently has a functioning
setup. But I agree that it's better to have a good error message. It
may be possible to check in some way that the gcc used is from cygwin
and add an error message specifically for this case. I'll have a look
at this when I'm next on Windows.

More generally I think that compiling non-cygwin extensions with
cygwin gcc should be altogether deprecated (for Python 3.4 at least).
It should be discouraged in the docs and unsupported in the future. It
can only work with -mno-cygwin which in turn only works with gcc 3.x,
has never documented as being a stable gcc feature, is now abandoned
and is referred to disparagingly on both the mingw and cygwin mailing
lists:

To quote Dave Korn from cygwin http://cygwin.com/ml/cygwin/2009-03/msg00802.html
'''
  gcc-3 -mno-cygwin still works just as well (or badly!) as it ever has done,
and will be retained for ever.

  gcc-4 series releases will not support it at all.  As the whole thing is
(still) experimental and explicitly warned to be unstable I don't see the need
to go for a deprecation period.
'''

Or Earnie Boyd from mingw-users
http://permalink.gmane.org/gmane.comp.gnu.mingw.user/42111
'''
On Mon, May 20, 2013 at 9:13 AM, Paul Moore wrote:
 So building an extension using --compiler=mingw in Python could pick up a
 cygwin gcc if that was on PATH, and this will work as long as -mno-cygwin is
 passed on the command line. But it won't work (it will build a DLL with a
 dependency on the cygwin DLL) if -mno-cygwin is omitted. I'd argue that
 people should just install and use mingw rather than cygwin, but that may
 not be what everyone does in practice.

No!!! The -mno-cygwin abomination is dead.  If you want to build a
native Python using Cygwin you would do it the cross compiler way and
state the --host you're configuring for.  Python's distutil needs to
remove the -mno-cygwin option.
'''

However no-cygwin mode is currently a documented feature:
http://docs.python.org/3.4/install/index.html#gnu-c-cygwin-mingw

So it can't simply be deprecated in already released Pythons but I do
want to fix the mingw bug there if possible. The suggestion to make
-mno-cygwin conditional on gcc major version may lead to some users
who attempt to use a setup that did not previously work not seeing the
appropriate error message. However it does, I believe, come with the
following two guarantees:

   1) Mingw setups that are used, wanted and currently broken will be fixed.
   2) No currently functional setups will be broken.

That may be the best that is possible given the tight constraints on
changes to distutils.

 That said: which of Roumen's patches (if any) would you recommend for 
 inclusion?

None. I may have misread them but my impression is that they are not
particularly intended to be used as individual patches. I can't see
one that just makes the relevant changes and collectively they make up
a more pervasive change than I was proposing.

The patch that I was proposing for 3.4 would simply remove -mno-cygwin
on these 5 lines:
http://hg.python.org/cpython/file/7fce9186accb/Lib/distutils/cygwinccompiler.py#l322

For 2.7, 3.2 and 3.3 I would do the same but conditional on self.gcc_version.

I think Roumen has identified many different issues but I would try
and keep it focussed on the one -mno-cygwin  issue.

Oscar

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12641
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com