[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-11-11 Thread Zac Hatfield-Dodds


Zac Hatfield-Dodds  added the comment:

I've recently had [1] reported, which makes https://bugs.python.org/issue45738 
the *third* parser bug [2] that Hypothesmith caught after release, and the 
second in a stable version - so I'll be maintaining a workaround for some time.

I remain happy to help run these tests upstream where they would impact fewer 
people.


[1] https://github.com/Zac-HD/hypothesmith/issues/16 and 
https://github.com/psf/black/pull/2592#issuecomment-966745240
[2] following https://bugs.python.org/issue42218 and 
https://bugs.python.org/issue40661

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-11-08 Thread Aaron Meurer


Change by Aaron Meurer :


--
nosy: +asmeurer

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-27 Thread Brandt Bucher


Change by Brandt Bucher :


--
nosy: +brandtbucher

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-25 Thread Terry J. Reedy


Terry J. Reedy  added the comment:

In order to understand what Paul is concretely proposing, I read and partly 
reviewed his PR.  I thought about maintainability.

Part A adds test.support.hypothesis_helper, similar in purpose to other 
specialized xyz_helper modules.  Hypothesis_helper imports either hypothesis 
itself or a replacement 'as hypothesis'.  The replacement is currently called 
'_hypothesis_stub', but its name is irrelevant as it is normally not seen by 
the user.

Part B adds a second zoneinfo test module, one that uses hypothesis. Besides 
further testing zoneinfo, it serves to test hypothesis_helper and partly 
justify its addition.  It starts with 'from hypothesis_helper import 
hypothesis.  The new test module has tests such as

@hypothesis.given(
dt=hypothesis.strategies.one_of(
hypothesis.strategies.datetimes(), hypothesis.strategies.times()
)
)
@hypothesis.example(dt=datetime.datetime.min)
@hypothesis.example(dt=datetime.datetime.max)
@hypothesis.example(dt=datetime.datetime(1970, 1, 1))
@hypothesis.example(dt=datetime.datetime(2039, 1, 1))
@hypothesis.example(dt=datetime.time(0))
@hypothesis.example(dt=datetime.time(12, 0))
@hypothesis.example(dt=datetime.time(23, 59, 59, 99))
def test_utc(self, dt):
zi = self.klass("UTC")
dt_zi = dt.replace(tzinfo=zi)

self.assertEqual(dt_zi.utcoffset(), ZERO)
self.assertEqual(dt_zi.dst(), ZERO)
self.assertEqual(dt_zi.tzname(), "UTC")

@given always (Paul?) runs the examples as subtests.  When the replacement is 
imported, no randomized examples are added.

If some year an example were to fail, could a hypothesis-ignorant 
zoneinfo-knowing person deal with the failure?  I believe so, knowing that the 
above is equivalent to a test with "for dt in dt_list:\n  with 
self.subTest(...):\n"

Why not require such a rewriting?  Some reasons against: Rewriting by hand can 
lead to errors.  test_utc would have to be split into test_utc_ran(dom) and 
test_utc_pre(set).  Code would have to be duplicated unless factored into a 
third function.  For and with together add two indents, which sometimes 
squeezes assert onto more lines.  I believe that when hypothesis is present, 
there are advantages to including preset examples with given-generated examples.

Paul would like the PR to be a 'camel's nose' in the CPython tent.  This cuts 
both ways.  I think this PR should be judged on its own merits.  Opening 
possibilities can be a merit as long as not seen as pre-approval.  Our CI 
testing and buildbots are already configured to blame and annoy the wrong 
people for random failures.  I don't want more unless failure notices are sent 
to someone responsible for the failing code.

--
versions: +Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-25 Thread Paul Ganssle

Paul Ganssle  added the comment:

> I use hypothesis during development, but don't have a need for in the the 
> standard library.  By the time code lands there, we normally have a specific 
> idea of what edge cases needs to be in the tests.

The suggestion I've made here is that we use @example decorators to take the 
hypothesis tests you would have written already and turn them in to what are 
essentially parameterized tests. For anyone who doesn't want to explicitly run 
the hypothesis test suite, the tests you are apparently already writing would 
simply turn into normal tests for just the edge cases.

One added benefit of keeping the tests around in the form of property tests is 
that you can run these same tests through hypothesis to find regressions in 
bugfixes that are implemented after landing (e.g. "Oh we can add a fast path 
here", which introduces a new edge case). The segfault bug from bpo-34454, for 
example, would have been found if I had been able to carry over the 
hypothesis-based tests I was using during the initial implementation of 
fromisoformat into later stages of the development. (Either because I didn't 
run it long enough to hit that particular edge case or because that edge case 
didn't come up until after I had moved the development locus into the CPython 
repo, I'm not sure).

Another benefit of keeping them around is that they become fuzz targets, 
meaning people like oss-fuzz or anyone who wants to throw some fuzzing 
resources at CPython have an existing body of tests that are expected to pass 
on *any* input, to find especially obscure bugs.

> For the most part, hypothesis has not turned up anything useful for the 
> standard library.  Most of the reports that we've gotten reflected a 
> misunderstanding by the person running hypothesis rather than an actual bug. 
> [...]

I don't really think it's a good argument to say that it hasn't turned up 
useful bugs. Most of the bugs in a bit of code will be found during development 
or during the early stages of adoption, and we have very wide adoption. I've 
found a number of bugs in zoneinfo using hypothesis tests, and I'd love to 
continue using them in CPython rather than throwing them away or maintaining 
them in a separate repo.

I also think it is very useful for us to write tests about the properties of 
our systems for re-use in PyPy (which does use hypothesis, by the way) and 
other implementations of Python. This kind of "define the contract and maintain 
tests to enforce that" is very helpful for alternate implementations.

> For numeric code, hypothesis is difficult to use and requires many 
> restrictions on the bounds of variables and error tolerances.  [...]

I do not think that we need to make hypothesis tests mandatory. They can be 
used when someone finds them useful.

> The main area where hypothesis seems easy to use and gives some comfort is in 
> simple roundtrips:  assert zlib.decompress(zlib.compress(s)) == s.  However, 
> that is only a small fraction of our test cases.

Even if this were the only time that hypothesis were useful (I don't think it 
is), some of these round-trips can be among the trickiest and important code to 
test, even if it's a small fraction of the tests. We have a bunch of functions 
that are basically "Load this file format" and "Dump this file format", usually 
implemented in C, which are a magnet for CVEs and often the target for fuzz 
testing for that reason. Having a small library of maintained tests for round 
tripping file formats seems like it would be very useful for people who want to 
donate compute time to fuzz test CPython (or other implementations!)

> Speed is another issue.  During development, it doesn't matter much if 
> Hypothesis takes a lot of time exercising one function.  But in the standard 
> library tests already run slow enough to impact development.  If hypothesis 
> were to run everytime we run a test suite, it would make the situation worse.

As mentioned in the initial ticket, the current plan I'm suggesting is to have 
fallback stubs which turn your property tests into parameterized tests when 
hypothesis is not installed. If we're good about adding `@example` decorators 
(and certainly doing so is easier than writing new ad hoc tests for every edge 
case we can think of when we already have property tests written!), then I 
don't see any particular reason to run the full test suite against a full 
hypothesis run on every CI run.

My suggestion is:

1. By default, run hypothesis in "stubs" mode, where the property tests are 
simply parameterized tests.
2. Have one or two CI jobs that runs *only* the hypothesis tests, generating 
new examples — since this is just for edge case detection, it doesn't 
necessarily need to run on every combination of architecture and platform and 
configuration in our testing matrix, just the ones where it could plausibly 
make a difference.
3. Ask users who are adding new hypothesis tests or who find new bugs to add 

[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-24 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Here's my two cents worth on the subject.  

* I use hypothesis during development, but don't have a need for in the the 
standard library.  By the time code lands there, we normally have a specific 
idea of what edge cases needs to be in the tests.

* For the most part, hypothesis has not turned up anything useful for the 
standard library.  Most of the reports that we've gotten reflected a 
misunderstanding by the person running hypothesis rather than an actual bug.  
For example, in colorsys, an issue was reported about the color conversions not 
round-tripping, but that is an expected behavior for out of gamut conversions.  
Also, the matrices used are dictated by standards and do not give exact 
inverses even for in gamut conversions.

* For numeric code, hypothesis is difficult to use and requires many 
restrictions on the bounds of variables and error tolerances.  For example, 
when I have students with a function to evaluate the quadratic formula and test 
it with hypothesis, they struggle mightily (overflowing the discriminant, 
having 2*a close to zero, deciding whether the roots are sufficiently close, 
etc.)

* The main area where hypothesis seems easy to use and gives some comfort is in 
simple roundtrips:  assert zlib.decompress(zlib.compress(s)) == s.  However, 
that is only a small fraction of our test cases.

* Speed is another issue.  During development, it doesn't matter much if 
Hypothesis takes a lot of time exercising one function.  But in the standard 
library tests already run slow enough to impact development.  If hypothesis 
were to run everytime we run a test suite, it would make the situation worse.

* There is also a learning curve issue.  We're adding more and more things that 
newcomer's have to learn before they can contribute (how to run blurb, how to 
use the argument clinic, etc).  Using Hypothesis is a learned skill and takes 
time.

TL;DR  I really like Hypothesis but think it doesn't belong in standard library 
tests.

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-24 Thread Guido van Rossum


Guido van Rossum  added the comment:

I withdraw my objection against Hypothetlsis per se. Others should debate
whether the 3rd party dependency is okay. --
--Guido (mobile)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-21 Thread Guido van Rossum


Guido van Rossum  added the comment:

Having read the parts of the paper that explain shortening, things are making 
more sense now. I have also successfully contributed type annotations to 
minithesis! :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-19 Thread Guido van Rossum


Guido van Rossum  added the comment:

It is important to me, because I don't like having a dependency on
third-party code that is a black box. Remember that in the end it's not me
you have to convince but the SC.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-19 Thread Paul Ganssle


Paul Ganssle  added the comment:

I do not want to dissuade you from figuring out how minithesis / hypothesis 
works (far from it), but I'm wondering if the question of how shrinking works 
is germane to the issue at hand, which is whether or not hypothesis / 
property-based-testing is suitable for testing the standard library.

I almost don't think it matters *if* shrinking works, much less *how*. 
Shrinking is something of a UI nicety in the sense that when hypothesis finds 
an example that violates a property, it will try to take whatever example it 
has chosen and find something like a "minimal working example" to show to the 
end user. So if we have something like:

```
@given(x=strategies.integers())
def f(x):
assert x >= 0
```

Hypothesis will presumably tell us that `x == -1` will violate the property 
instead of `-24948929` or some other random thing. But if it did report 
`-24948929` or something, it wouldn't be too hard to see "oh, right, integers 
can be negative". In some cases with the time zone stuff, there isn't a good 
metric for complexity at the moment, so time zones are sorted by their IANA key 
(which is to say, basically arbitrary), and I generally find it useful anyway 
(admittedly, the shrinking still is helpful because most problems affecting all 
zones will return Africa/Abuja, whereas things particular to a specific zone's 
odd time zone history will return whichever zone with that quirk comes 
alphabetically first).

Anyway, do not let me disrupt your process, I just thought it might be worth 
making the point that some of these specific details might be nice to know, but 
don't seem like they should be blockers for hypothesis' adoption in the 
standard library.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-19 Thread Guido van Rossum


Guido van Rossum  added the comment:

Few things incite me quite as much as being told that I'm not supposed to 
understand something, so now I'm actually making an effort to follow the ECOOP 
paper. Honestly, it's not too bad (a lot more readable than type theory papers 
:-). I'll get back to you once I'm more comfortable in mapping between that and 
minithesis.py.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-19 Thread Zac Hatfield-Dodds


Zac Hatfield-Dodds  added the comment:

> Okay, well, I'm trying to understand minithesis.py, but I am despairing. The 
> shrinking code (really everything in class TestingState) is too much to grok. 
> ... I suppose I have to read the paper at 
> https://drmaciver.github.io/papers/reduction-via-generation-preview.pdf ... I 
> am still unclear on some basics too, but presumably it's all related.

Shrinking is probably not important to understand, since it's *not* 
particularly related to anything but implementation details.  The abstract of 
that paper explains why the design is important - integrated shrinking means we 
can report minimal (easier-to-understand) failing examples with zero user 
effort, and guarantee that all shrunk examples could have been generated.

TestingState is, roughly, the `@given()` decorator internals: responsible for 
deciding how many times to run the wrapped test function, and how.  Note the 
three phases (generate, target, shrink) - the distribution of examples can 
depend on the behaviour of the test function on earlier inputs.  This is _more_ 
complex in Hypothesis and thus challenging to document.

The important concept is that input examples are generated by a Possibility (ie 
Strategy), which internally makes a sequence of choices; and that sequence is 
sourced from a maybe-empty known prefix (e.g. replaying failures, mutating 
previous inputs) with the suffix determined by a PRNG.


I feel like I'm explaining this badly, but there isn't a simple underlying 
thing to explain - Hypothesis works well because the user API is relatively 
simple, and then under the hood we have this many-layered stack of 
implementation details.  The synergies and coverage of otherwise-pathological 
cases this delivers is excellent, but "multiple years and PhDs worth of 
detailed synergistic work" is just intrinsically complicated.


> PS. What does "shortlex" mean?

Sort by length, and among equal-length strings sort in the normal 
(alphabetical) order.  For example: A < XY < ABC, etc.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-18 Thread Guido van Rossum


Guido van Rossum  added the comment:

Okay, well, I'm trying to understand minithesis.py, but I am despairing. The 
shrinking code (really everything in class TestingState) is too much to grok. 
You can tell from the amount of comments that this is the tour-de-force of the 
code, but the comments don't really help -- they seem to be mostly notes for 
people trying to reimplement this in another language. (Plus some snide remarks 
about QuickCheck.)

I suppose I have to read the paper at 
https://drmaciver.github.io/papers/reduction-via-generation-preview.pdf ... At 
least that has some drawings.

I am still unclear on some basics too, but presumably it's all related.

PS. What does "shortlex" mean?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-14 Thread Paul Ganssle


Paul Ganssle  added the comment:

@Terry

> The problem with random input tests in not that they are 'flakey', but that 
> they are useless unless someone is going to pay attention to failures and try 
> to find the cause.  This touches on the difference between regression testing 
> and bug-finding tests.  CPython CI is the former, and marred at that by buggy 
> randomly failing tests.

> My conclusion: bug testing would likely be a good idea, but should be done 
> separate from the CI test suite.  Such testing should only be done for 
> modules with an active maintainer who would welcome failure reports.

Are you saying that random input tests are flaky but that that is not the big 
problem? In my experience using hypothesis, in practice it is not the case that 
you get tests that fail randomly. The majority of the time if your code doesn't 
violate one of the properties, the tests fail the first time you run the test 
suite (this is particularly true for strategies where hypothesis deliberately 
makes it more likely that you'll get a "nasty" input by biasing the random 
selection algorithm in that direction). In a smaller number of cases, I see 
failures that happen on the second, third or fourth run.

That said, if it were a concern that every run of the tests is using different 
inputs (and thus you might see a bug that only appears once in every 20 runs), 
it is possible to run hypothesis in a configuration where you specify the seed, 
making it so that hypothesis always runs the same set of inputs for the same 
tests. We can disable that on a separate non-CI run for hypothesis "fuzzing" 
that would run the test suite for longer (or indefinitely) looking for 
long-tail violations of these properties.

I feel that if we don't at least run some form of the hypothesis tests in CI, 
there will likely be bit rot and the tests will decay in usefulness. Consider 
the case where someone accidentally breaks an edge case that makes it so that 
`json.loads(json.dumps(o))` no longer works for some obscure value of `o`. With 
hypothesis tests running in CI, we are MUCH more likely to find this bug / 
regression during the initial PR that would break the edge case than if we run 
it separately and report it later. If we run the hypothesis tests in a 
build-bot, the process would be:

1. Contributor makes PR with passing CI.
2. Core dev review passes, PR is merged.
3. Buildbot run occurs and the buildbot watch is notified.
4. Buildbot maintainers track down the PR responsible and either file a new bug 
or comment on the old bug.
5. Someone makes a NEW PR adding a regression test and the fix for the old PR.
6. Core dev review passes, second PR is merged.

If we run it in CI, the process would be:

1. Contributor makes PR, CI breaks.
2. If the contributor doesn't notice the broken CI, core dev points it out and 
it is fixed (or the PR is scrapped as unworkable).

Note that in the non-CI process, we need TWO core dev reviews, we need TWO PRs 
(people are not always super motivated to fix bugs that don't affect them that 
they the caused when fixing a bug that does affect them), and we need time and 
effort from the buildbot maintainers (note the same applies even if the 
"buildbot" is actually a separate process run by Zac out of a github repo).

Even if the bug only appears in one out of every 4 CI runs, it's highly likely 
that it will be found and fixed before it makes it into production, or at least 
much more quickly, considering that most PRs go through a few edit cycles, and 
a good fraction of them are backported to 2-3 branches, all with separate CI 
runs. It's a much quicker feedback loop.

I think there's an argument to be made that incorporating more third-party 
libraries (in general) into our CI build might cause headaches, but I think 
that is not a problem specific to hypothesis, and I think its one where we can 
find a reasonable balance that allows us to use hypothesis in one form or 
another in the standard library.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-14 Thread Terry J. Reedy


Terry J. Reedy  added the comment:

This issue confuses two different testing issues.  First is selection of test 
data before the test versus generation while testing (usually 'randomly')  
Second is how the function is tested.  All tests test some property of the 
inputs and function, but it is useful to distinguish between testing that a 
function produces a specific expected output versus testing something more 
abstract or defined in terms of the inputs.

Example 1: 2 + 3 = 5 (a + b = (known) c) versus 2 + 3 = 3 + 2 (a + b = b + a) 
and (2 + 3) + 4 = 2 + (3 + 4).

Example 2: sorted([3,5,1,4]) = [1,3,4,5] versus is_sorted(sorted([random 
list]).  

The latter are what hypothesis people mean by 'property tests'.  Such property 
tests can be done with either fixed or random pairs in unittest module tests.  
The test that a module compiles is a weak property test, but better than 
nothing.  A test that that all lines of code runs at least once without an 
unexpected exception a stronger property test.  I consider it roughly on a par 
with min(s) <= mean(s) <= max(s).

The connection between randomized input testing, such as with hypothesis, and 
property tests is that with random inputs, one cannot test f(input) against an 
'expected' value unless one has an alternate means of producing the expected 
value.  It is often easier to test a parameterized equation or other properties.

The problem with random input tests in not that they are 'flakey', but that 
they are useless unless someone is going to pay attention to failures and try 
to find the cause.  This touches on the difference between regression testing 
and bug-finding tests.  CPython CI is the former, and marred at that by buggy 
randomly failing tests.

My conclusion: bug testing would likely be a good idea, but should be done 
separate from the CI test suite.  Such testing should only be done for modules 
with an active maintainer who would welcome failure reports.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-13 Thread Guido van Rossum


Guido van Rossum  added the comment:

Thanks, I will look into that before stirring up more dust.

--
nosy: +Guido.van.Rossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-13 Thread Zac Hatfield-Dodds


Zac Hatfield-Dodds  added the comment:

(I think structuring this as a high-level explanation is clearer than 
point-by-point replies - it covers most of them, but the logic is hopefully 
easier to follow)


The core idea of Hypothesis is that making random choices is equivalent to 
parsing a random bitstring:
- wrap your test function in a decorator which calls it many times with 
random inputs
- describe those inputs with strategies, (which are parser combinators)
- the bytes parsed by strategies can be provided by a Random() instance, 
saved to and loaded from a database, edited and (maybe) re-parsed into related 
or simpler inputs, etc.
Minithesis [1] is a complete implementation of this core, including strategies, 
shrinking, and the database, in a few hundred lines of Python.  I think reading 
this will answer most of your implementation questions.

Hypothesis is not the *only* property-based testing library for Python (there's 
pytest-quickcheck), but in the same sense that `six` is not the only py2/py3 
compatibility layer.  I attribute this to a combination of a better underlying 
model [2], and brute implementation effort to provide great error messages, 
integration with popular libraries, lots of strategies, related tooling, and so 
on.

[1] https://github.com/DRMacIver/minithesis
See also https://hypothesis.works/articles/how-hypothesis-works/ for an 
older writeup; the implementation details are a little outdated but it explains 
the motivations better.
[2] https://hypothesis.works/articles/types-and-properties/



So a "strategy" is an object (implementation detail: a parser combinator, 
instance of SearchStrategy), which tells Hypothesis how to convert a random 
bytestring into whatever object you've asked for.  This is a little unusual, 
but I haven't found a better way which still makes it easy to mix the strategy 
primitives with arbitrary user code such as "pick a username, ensure it's in 
the database, and then pass it to my test method".

I share your distaste for "Haskell in Python"; I don't think Hypothesis 
introduces more of that flavor than Numpy would introduce Fortran idioms - 
there's occasionally a resemblance, but they're both Python-first libraries and 
I'd attribute it mostly to the problem being solved.  FWIW Hypothesis 
deliberately rejected most of the design choices of QuickCheck, and 
implementation-wise has more in common with unixy fuzzing tools like AFL.

The strategies in our API can generally be divided into useful primitives (e.g. 
integers, floats, lists, tuples, sampled_from, one_of) and pre-composed 
strategies (booleans = sampled_from, dictionaries = map+lists-of-kv-tuples, 
builds = tuples+fixed_dictionaries+map).
Very specific strategies like emails() are included because they're a common 
need, and a naive implementation usually omits exactly the weirder and 
error-prone features that might reveal bugs.

We _are_ deliberately vague about the order in which we generate test data, 
because the distribution is full of heuristics and adapts itself to the 
behaviour of the test - including over multiple runs, thanks to the database.  
The exact details are also considered an implementation detail, and subject to 
change as we discover heuristics and sampling techniques which find bugs faster.

The upside here is writing effective strategies is straightforward:
- expressing more possible (valid) inputs is good, because e.g. if the bug 
triggers on NaN we'd have to generate some NaNs to detect it; and
- limiting use of rejection sampling (hypothesis.assume(), the .filter() 
method) is an obvious performance improvement - if only 20% of inputs we try to 
generate are valid, the test will take five times longer.
following which Hypothesis' backend (the "conjecture" engine) uses a large 
helping of heuristics and fancy code to ensure that we get a nice diverse 
sample from the inputs space.  Unfortunately there's a direct tension between 
high performance and explainable behaviour; Minithesis is readable mostly by 
virtue of not handling the hard edge cases.



I'll take my Hypothesmith [3] project for random Python programs as an example 
of a complex strategy.  It was inspired by CSmith [4] - but is considerably 
less sophisticated.  CSmith constructs semantically valid C programs; 
Hypothesmith (for now!) constructs a subset of syntatically-valid Python.  The 
basic idea is to "invert" the grammar: start with the root node, recursively 
construct a valid parse tree, and then serialise it out to a parsable input 
string.  Doing a better job would require more time than I had for a 
proof-of-concept, but no new insights; and the same applies to defining a 
strategy for semantically-valid code.  Being based on the grammar I'd guess 
it's unlikely to find corner cases in the grammar, but it might do so in the 
parser, or via e.g. `(tree := ast.parse(code)) == ast.parse(ast.unparse(tree))`.

Your distrust of the python_programs() strategy 

[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2021-05-13 Thread Guido van Rossum


Guido van Rossum  added the comment:

I would like to have a more thorough discussion about the desirability of using 
Hypothesis first, since I feel that there is a rather hard "sell" going on. I 
brought this up in the SC tracker 
(https://github.com/python/steering-council/issues/65) but I don't want to have 
the discussion there.

Copying some quoted text from there:

[me]
> > Can we perhaps have an open discussion on the merits of Hypothesis itself 
> > on python-dev before committing to this?

[Paul]
> You mean hypothesis specifically or property-based testing in general? I 
> think that if we add property-based testing, hypothesis is probably the only 
> natural choice. Of course, I can make this case on the thread if that was 
> your intention.

I don't even know what property-based testing means (for Python) other than 
Hypothesis. What other frameworks support that? In any case, nobody has been 
promoting anything else, but Zac has been on our case for over a year. :-)

[me]
> > It seems to promote a coding style that's more functional than is good for 
> > Python, and its inner workings seem too magical to me. Also the 
> > decorator-based DSL looks pretty inscrutable to me.

[Paul]
> Do you mean it's promoting a coding style within the tests, or in the public 
> API? When designing zoneinfo I didn't think about how easy it would be to 
> test with Hypothesis at all as part of the API and it was basically 
> frictionless.

I meant in the public API. I don't doubt that for zoneinfo this worked well, 
but that doesn't prove it's any better than any other framework (maybe you 
would have been ecstatic if you could have used pytest as well :-). 
Fundamentally, I am a bit suspicious of Hypothesis' origins, Haskell, which is 
a language and community that just have a different approach to programming 
than Python. Don't get me wrong, I don't think there's anything wrong with 
Haskell (except that I've noticed it's particularly popular with the IQ >= 150 
crowd :-). It's just different than Python, and just as we don't want to 
encourage writing "Fortran or Java in Python", I don't think it's a good idea 
to recommend writing "Haskell in Python".

> I think there are situations where it makes sense to write functional-style 
> strategies because it can in some situations give hypothesis more information 
> about how the strategies are transformed (and thus allow it to optimize how 
> the data is drawn rather than drawing from the full set and discarding a 
> bunch of stuff that doesn't match, or doing a bunch of extra logic on each 
> draw), but it's by no means required. I'd say most of my "complex strategies" 
> are decorated functions rather than chains of maps and filters.

Ah, here we get to the crux of the matter. What's a "strategy"? Maybe I would 
be more inclined to support Hypothesis if it was clearer what it does. I like 
libraries that do stuff for me that I know how to do myself. With Hypothesis, 
the deeper I get into the docs, the more I get the impression that there is 
deep magic going on that a poor sap like myself isn't supposed to understand. 
When I write @given(st.lists(st.integers)) what does that do? I haven't the 
foggiest idea, and from the docs I get the impression I'm not supposed to worry 
about it, and *that* worries me. Clearly it can't be generating all lists of 0 
integers followed by all lists of 1 integer followed by all lists of 2 integers 
... So in what order does it generate test data? The docs are pretty vague 
about that.

(I'm not playing dumb here. I really have no idea how to go about that, and it 
seems that this is at the heart of Hypothesis and its' Haskell ancestors. 
Presumably Ph.D theses went into making this sort of thing work. Perhaps more 
effort could be expended explaining this part to laypeople?)

> From the stdlib-property-tests repo, this seems like the most complicated 
> strategy I'm seeing employed, and I find it fairly simple to understand, and 
> I don't know how easy to read any code would be that generates trees of 
> arbitrary depth containing objects with specific properties.

I have never tried that sort of thing, so I'll take your word for it.

But there are other strategies. It seems that a lot of the strategies are 
answers to a series of questions, along the lines of "how do I generate valid 
URLs"; and then "email addresses"; and then "IP addresses"; and then suddenly 
we find a strategy that generates *functions*, and then we get "recursive 
things", and it keeps going. Timezone keys?! No wonder you found it easy to 
use. :-)

I think I've heard that there's a strategy somewhere that generates random 
Python programs. How would that even work? Would it help me find corner cases 
in the grammar? I suppose we could actually use something like that for the 
"faster CPython" project, to validate that the optimizer doesn't break things. 
But how? Do I just write

@given(st.python_programs(), st.values())
def test_optimizer(func, 

[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2020-10-21 Thread Paul Ganssle


Change by Paul Ganssle :


--
keywords: +patch
pull_requests: +21805
pull_request: https://github.com/python/cpython/pull/22863

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42109] Use hypothesis for testing the standard library, falling back to stubs

2020-10-21 Thread Paul Ganssle


New submission from Paul Ganssle :

Following up on this year's language summit, I would like to propose that we 
start integrating property tests into the standard library. Zac Hatfield-Dodds 
(along with myself and Carl Friedrich Bolz-Tereick) have put together this 
repository of tests that we run against the standard library as is: 
https://github.com/Zac-HD/stdlib-property-tests

Here is the blog post covering the proposal from the language summit: 
https://pyfound.blogspot.com/2020/05/property-based-testing-for-python.html

The biggest challenges here are logistical:

1. Pulling in third party dependencies is difficult to do reliably on CI, but 
checking in hypothesis's entire dependency tree is probably not feasible.
2. We don't necessarily want to require developers running their tests locally 
to have to set up a hypothesis environment just to run the tests.
3. Hypothesis tests are not (by default) deterministic, which some are 
concerned may lead to flakiness by themselves.


To allay these concerns, I propose that we implement a compatibility interface 
for hypothesis that uses the third party module when it's installed, but 
otherwise falls back to stubs. The way I see the stubs working is that `@given` 
*without* `@example`s would simply skip the test. If you specify `@given` and 
one or more `@example`s, the test falls back to a simple parameterized test 
when hypothesis is not available.

At least at first, we won't attempt to add a mandatory PR job that runs with 
hypothesis installed. Instead, I'd like to run either an optional job on PR or 
have one or more buildbots that runs the hypothesis tests.

I would also like to suggest a policy of including at least one example in each 
property test, so that on PR at least some of the inputs are tested.

--
assignee: p-ganssle
components: Tests
messages: 379226
nosy: Zac Hatfield-Dodds, p-ganssle, terry.reedy
priority: normal
severity: normal
stage: patch review
status: open
title: Use hypothesis for testing the standard library, falling back to stubs
type: enhancement
versions: Python 3.10

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com