[issue41972] bytes.find consistently hangs in a particular scenario

2021-07-19 Thread Łukasz Langa
Łukasz Langa added the comment: Looks like this can be closed now, the original issue is fixed, the original patch is merged for 3.10, and the improved patch is merged for 3.11. Thanks! ✨  ✨ -- resolution: -> fixed stage: patch review -> resolved status: open -> closed

[issue41972] bytes.find consistently hangs in a particular scenario

2021-07-19 Thread Łukasz Langa
Łukasz Langa added the comment: New changeset d01dceb88b2ca6def8a2284e4c90f89a4a27823f by Dennis Sweeney in branch 'main': bpo-41972: Tweak fastsearch.h string search algorithms (GH-27091) https://github.com/python/cpython/commit/d01dceb88b2ca6def8a2284e4c90f89a4a27823f --

[issue41972] bytes.find consistently hangs in a particular scenario

2021-07-19 Thread Łukasz Langa
Łukasz Langa added the comment: I checked the original example in this issue and the newest change in GH-27091 makes the `data.find(base)` case 8.2% faster compared to `main` while the `data.find(longer)` case is 10.8% faster. -- nosy: +lukasz.langa

[issue41972] bytes.find consistently hangs in a particular scenario

2021-07-11 Thread Dennis Sweeney
Dennis Sweeney added the comment: In "Fast String Searching" (1991), A. Hume and D. Sunday emphasize that most of the runtime of string-search algorithms occurs in a code path that skips over immediately-disqualified alignments. As such, that paper recommends extracting a hot "Horspool"

[issue41972] bytes.find consistently hangs in a particular scenario

2021-07-11 Thread Dennis Sweeney
Change by Dennis Sweeney : -- pull_requests: +25639 pull_request: https://github.com/python/cpython/pull/27091 ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2021-02-28 Thread Dennis Sweeney
Change by Dennis Sweeney : -- pull_requests: +23457 pull_request: https://github.com/python/cpython/pull/24672 ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2021-02-28 Thread Tim Peters
Tim Peters added the comment: New changeset 73a85c4e1da42db28e3de57c868d24a089b8d277 by Dennis Sweeney in branch 'master': bpo-41972: Use the two-way algorithm for string searching (GH-22904) https://github.com/python/cpython/commit/73a85c4e1da42db28e3de57c868d24a089b8d277 --

[issue41972] bytes.find consistently hangs in a particular scenario

2021-02-27 Thread Tim Peters
Tim Peters added the comment: I'm very sorry for not keeping up with this - my health has been poor, and I just haven't been able to make enough time. Last time I looked to a non-trivial depth, I was quite happy, and just quibbling about possible tradeoffs. I can't honestly commit to doing

[issue41972] bytes.find consistently hangs in a particular scenario

2021-02-27 Thread Guido van Rossum
Guido van Rossum added the comment: If Tim approves we might get it into alpha 6 which goes out Monday. -- ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2021-02-27 Thread Dennis Sweeney
Dennis Sweeney added the comment: Any chance PR 22904 can make it into 3.10 before the May 03 feature freeze? The text document in that PR has some notes on how the algorithm works, so that may be a good place to start reviewing if anyone is interested. --

[issue41972] bytes.find consistently hangs in a particular scenario

2021-02-26 Thread Carl Friedrich Bolz-Tereick
Carl Friedrich Bolz-Tereick added the comment: > BTW, this initialization in the FASTSEARCH code appears to me to be a > mistake: >skip = mlast - 1; Thanks for pointing that out Tim! Turns out PyPy had copied that mindlessly and I just fixed it. (I'm also generally following along

[issue41972] bytes.find consistently hangs in a particular scenario

2021-01-17 Thread Dennis Sweeney
Dennis Sweeney added the comment: PR 22904 now adds a text document explaining how the Two-Way algorithm works in much more detail. I was looking at more benchmarking results, and I came to a couple of conclusions about cutoffs. There's a consistent benefit to using the two-way algorithm

[issue41972] bytes.find consistently hangs in a particular scenario

2020-12-12 Thread Dennis Sweeney
Change by Dennis Sweeney : Added file: https://bugs.python.org/file49674/twoway_demo.py ___ Python tracker ___ ___ Python-bugs-list mailing

[issue41972] bytes.find consistently hangs in a particular scenario

2020-12-12 Thread Dennis Sweeney
Change by Dennis Sweeney : Removed file: https://bugs.python.org/file49672/twoway_demo.py ___ Python tracker ___ ___ Python-bugs-list

[issue41972] bytes.find consistently hangs in a particular scenario

2020-12-12 Thread Dennis Sweeney
Dennis Sweeney added the comment: For convenience, attached is a quick and dirty Tkinter GUI that lets you step through the Crochemore/Perrin Algorithm on your choice of inputs, just for play/discovery. A good illustration of the memory for periodic needles can be found by testing:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-11-09 Thread STINNER Victor
Change by STINNER Victor : -- nosy: -vstinner ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-11-06 Thread Dennis Sweeney
Dennis Sweeney added the comment: > Toward that end it would be fine to use "very high" cutoffs, and save tuning > for later. This feels reasonable to me -- I changed the cutoff to the more cautious `if (m >= 100 && n - m >= 5000)`, where the averages are very consistently faster by my

[issue41972] bytes.find consistently hangs in a particular scenario

2020-11-06 Thread Tim Peters
Tim Peters added the comment: But that also suggests a different approach: start with the current code, but add introspection to switch to your enhancement of C if the current code is drifting out of linear-time territory. -- ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-11-06 Thread Tim Peters
Tim Peters added the comment: I'm sorry I haven't been able to give more time to this. I love what's been done, but am just overwhelmed :-( The main thing here is to end quadratic-time disasters, without doing significant damage in other cases. Toward that end it would be fine to use "very

[issue41972] bytes.find consistently hangs in a particular scenario

2020-11-06 Thread Dennis Sweeney
Dennis Sweeney added the comment: I used the following script, and the raw results are attached. import pyperf runner = pyperf.Runner() ms = [12, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536] ns = [2000, 3000, 4000, 6000, 8000,

[issue41972] bytes.find consistently hangs in a particular scenario

2020-11-06 Thread Dennis Sweeney
Dennis Sweeney added the comment: I am attempting to better understand the performance characteristics to determine where a cutoff should go. Attached is a colorful table of benchmarks of the existing algorithm to the PR with the cutoff changed to `if (1)` (always two-way) or `if (0)`

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-28 Thread Tal Einat
Tal Einat added the comment: I should also mention that I configured and compiled Python for each of the above-mentioned versions using the --enable-optimizations` flag. -- ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-28 Thread Tal Einat
Tal Einat added the comment: After spending quite a while setting up my dev machine for (hopefully) reliable benchmarking, I can say with a high level of certainty that the latest PR (GH-22904, commit fe9e9d9c1f1c5f98c797d19e2214d1413701f6de) runs stringbench.py significantly faster than

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-23 Thread Dennis Sweeney
Dennis Sweeney added the comment: Here are those zipf-distributed benchmarks for PR 22904: https://pastebin.com/raw/qBaMi2dm Ignoring differences of <5%, there are 33 cases that get slower, but 477 cases that got faster. Here's a stringbench.py run for PR 22904:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-23 Thread Dennis Sweeney
Dennis Sweeney added the comment: This is the approach in the PR: # JUMP_BOTH while ...: if haystack[j + cut] != needle[cut]: # Sunday skip j += table[haystack[j + len(needle)]] continue j += rest_of_the_two_way_algorithm() If I

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-22 Thread Tim Peters
Tim Peters added the comment: Note that Sunday doesn't care (at all) where mismatches occur. The "natural" way to add Sunday: follow pure C-P unless/until it finds a mismatching position. Pure C-P then computes a specific shift. Nothing about that changes. But something is added: also

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-22 Thread Dennis Sweeney
Dennis Sweeney added the comment: I attached a new PR, with a lot of the same ideas. The major differences between this and the last PR: * The first character to be checked at each alignment is the first character of the right half, rather than the last. * If that first character does not

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-22 Thread Dennis Sweeney
Change by Dennis Sweeney : -- pull_requests: +21836 pull_request: https://github.com/python/cpython/pull/22904 ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-19 Thread Dennis Sweeney
Dennis Sweeney added the comment: Unfortunately due to licensing issues, it looks like I'll have to ditch the PR and start from scratch: it was too heavily based on the glibc implementation, which has a stricter license. It may be take a good deal of time before I have a potential

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-18 Thread Tim Peters
Tim Peters added the comment: I confess I _assumed_ all along that you were generalizing the current code's Sunday trick to 7-bit equivalence classes (up from 32 bits total) and 64K possible shift counts (up from just 2 total possibilities: 1 or len(needle)+1). The Sunday trick couldn't

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-18 Thread Dennis Sweeney
Dennis Sweeney added the comment: I posted the example thinking that having a concrete walkthrough might be good for discussion, and it looks like it was. ;-) This makes me curious how a simplified-but-not-as-simplified-as-the-status-quo Sunday algorithm would fare: using the Sunday

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-18 Thread Dennis Sweeney
Dennis Sweeney added the comment: > Dennis, I think that's expected, right? Two-way on its own can exploit > nothing about individual characters - it only preprocesses the needle to > break the possibility for quadratic-time behavior due to periods in the > needle. Yes, that's correct. >

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-18 Thread Tim Peters
Tim Peters added the comment: Dennis, I think that's expected, right? Two-way on its own can exploit nothing about individual characters - it only preprocesses the needle to break the possibility for quadratic-time behavior due to periods in the needle. It sounds like you switched the

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-18 Thread Dennis Sweeney
Dennis Sweeney added the comment: FWIW, one of the "# Made the spaces line up" is actually a "skip ahead by the needle length". -- ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-18 Thread Dennis Sweeney
Dennis Sweeney added the comment: Below is one of the tests that got run when I happened to import something, and I thought it was a good illustration of the Boyer-Moore bad character shift table. It's worth noting in particular that the table is the dominant force for speed in some common

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-17 Thread Tim Peters
Tim Peters added the comment: Yup, they act essentially the same, but yours jumps into the quicksand earlier ;-) I'm fond of this one: """ HUGE = 10**7 BIG = 10**6 bigxs = 'x' * BIG haystack = 'x' * HUGE needle = bigxs + 'y' + bigxs """ "The problem" then is in middle of the needle, not

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-17 Thread Guido van Rossum
Guido van Rossum added the comment: Right, so IIUC the *quadratic* portion of Dennis’ original reproducer, aBaBBB with pattern BBB, except with more cowbell :-) acts the same. -- ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-17 Thread Tim Peters
Tim Peters added the comment: I don't think we have any idea how the OP stumbled into this. Looks like it "just happened". The case you construted is quadratic-time, but not quite as bad: BaB BB Fails at once, because 'a' doesn't match the trailing needle 'B'. The Bloom

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-17 Thread Guido van Rossum
Guido van Rossum added the comment: This may be irrelevant at this point, but trying to understand the original reproducer, I wanted to add my 1c worth. It seems Dennis' reproducer.py is roughly this: (I'm renaming BIG//3 to K to simplify the math.) aBaBBB

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-17 Thread Dennis Sweeney
Dennis Sweeney added the comment: > But there _also_ seem to be real (but much smaller) benefits > for the "search backward" cases, which I don't recall seeing > when I tried it. Do you have a guess as to why? I did change `skip = mlast - 1;` to `skip = mlast;` as you had pointed out. So it

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-17 Thread Tim Peters
Tim Peters added the comment: When I ran stringbench yesterday (or the day before - don't remember), almost all the benefit seemed to come from the "late match, 100 characters" tests. Seems similar for your run. Here are your results for just that batch, interleaving the two runs to make

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-17 Thread Dennis Sweeney
Dennis Sweeney added the comment: I added the cutoff for strings >= 10 characters, and I converted the PR from a draft to "Ready to Review." When running stringbench.py before and after the PR, I get these results: Summary: Unicode Before: 81.82 Bytes Before: 92.62 Unicode

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Tim Peters
Tim Peters added the comment: BTW, in the old post of Fredrick's I linked to, he referred to a "stringbench.py" program he used for timings, but the link is dead. I was surprised to find that it still lives on, under the Tools directory. Or did - I'm on Windows now and its Python distro

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Tim Peters
Tim Peters added the comment: And changed the "Type" field to "performance", because speed is the only issue here. -- type: behavior -> performance ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Tim Peters
Tim Peters added the comment: Removed 3.8 from the Versions list. The code was functioning as designed, and the O(n*m) worst case bounds were always known to be possible, so there's no actual bug here. -- versions: -Python 3.8 ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Tim Peters
Tim Peters added the comment: I don't think Rabin-Karp is worth trying here. The PR is provably worst-case linear time, and the constant factor is already reasonably small. Its only real weakness I can see is that it can be significantly (but seemingly not dramatically) slower than the

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Gregory P. Smith
Gregory P. Smith added the comment: I assume a rolling hash is linear at best, even if you do add some skip ahead checks. -- ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Dennis Sweeney
Dennis Sweeney added the comment: I'm doing a couple more timing tests to try to understand exactly when the cutoff should be applied (based on some combination of needle and haystack lengths). Can the rolling hash algorithm be made to go sublinear like O(n/m)? It looked like it was pretty

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Gregory P. Smith
Gregory P. Smith added the comment: Another potential algorithm to consider in large needle situations is a Rabin-Karp rolling hash string search. If used, it's the kind of algorithm that I'd probably bail out to an alternate method on if a run of Rabin-Karp started having a high percentage

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-16 Thread Gregory P. Smith
Gregory P. Smith added the comment: FWIW I think your numbers look good, a small needle cut-off is likely a good idea. -- ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-14 Thread Dennis Sweeney
Dennis Sweeney added the comment: The most recent batch of commits added a jump table. Between master and PR 22679 now, there are 151 cases slower than master and 463 that faster than master. The slower cases are at most twice as slow, but the faster cases are often 10-20x faster. I could

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-14 Thread Tim Peters
Tim Peters added the comment: For completeness, a link to the python-dev thread about this: https://mail.python.org/archives/list/python-...@python.org/thread/ECAZN35JCEE67ZVYHALRXDP4FILGR53Y/#4IEDAS5QAHF53IV5G3MRWPQAYBIOCWJ5 -- ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Gregory P. Smith
Change by Gregory P. Smith : -- nosy: +gregory.p.smith ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Tim Peters
Tim Peters added the comment: > There's no discomfort at all to me if, e.g., it stored > 32-bit counts and is indexed by the last 6 bits of the > character. That's a measly 256 bytes in all. Or, for the same space, 16-bit counts indexed by the last 7 bits. Then there's no aliasing for 7-bit

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Tim Peters
Tim Peters added the comment: Dennis, I'm delighted that the timing harness pointed out an actual glitch, and that it was so (seemingly) straightforward to identify the algorithmic cause. This gives me increased confidence that this project can be pushed to adoption, and your name will be

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Dennis Sweeney
Dennis Sweeney added the comment: That test needle happened to end with a G and not have another G until much earlier. The status quo took advantage of that, but the PR only takes advantage of the skip value for a certain middle character. Perhaps it could do both. --

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Dennis Sweeney
Dennis Sweeney added the comment: @Tim I got this again for that benchmark: length=3442, value=ASXABCDHAB...: Mean +- std dev: 2.39 ms +- 0.01 ms Unfortunately not a ghost. -- ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Dennis Sweeney
Dennis Sweeney added the comment: Another algorithmic possibility: Instead of the bitset, we could have a stack-allocated uint8_t jump[32]; // maybe 64? Maybe uint16_t? It would say this: If the last character lined up in the haystack is congruent to i mod (1 << 8), then jump ahead by

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Tim Peters
Tim Peters added the comment: Dennis, would it be possible to isolate some of the cases with more extreme results and run them repeatedly under the same timing framework, as a test of how trustworthy the _framework_ is? From decades of bitter experience, most benchmarking efforts end up

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Dennis Sweeney
Dennis Sweeney added the comment: bench_table.txt gives my results (`ref` is Master, `change` is with PR 22679). The change gives 342 faster cases and 275 slower cases, and 9 cases with no change. I chose a random word of length 10**6 with a zipf character distribution for the haystack,

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Tim Peters
Tim Peters added the comment: > For a bunch of cases it's slower, for some others it's faster. I have scant real idea what you're doing, but in the output you showed 4 output lines are labelled "slower" but 18 are labelled "faster". What you wrote just above appears to say the reverse (I'd

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread STINNER Victor
STINNER Victor added the comment: I compared PR 22679 using the commit 77f0a23e7a9fb247101b9b14a060c4ba1c4b87a5 as the reference using random_bench.py. For a bunch of cases it's slower, for some others it's faster. My modified pyperf computes a geometric mean of: 0.70 (faster). $

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread Dong-hee Na
Change by Dong-hee Na : -- nosy: +corona10 ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread STINNER Victor
STINNER Victor added the comment: FYI after I saw bench_results.txt, I wrote a pyperf PR to add geometric mean, to more easily summarize a benchmark suite :-D https://github.com/psf/pyperf/pull/79 -- ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread STINNER Victor
STINNER Victor added the comment: random_bench.py has really weird timings. Example: (...) Run 15: 1 warmup, 3 values, 2^15 loops - warmup 1: 3.05 us (+504%) - value 1: 42.8 ns (-92%) - value 2: 1.18 us (+133%) - value 3: 293 ns (-42%) (...) Run 21: 1 warmup, 3 values, 2^15 loops - warmup 1:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-13 Thread STINNER Victor
STINNER Victor added the comment: > I used random_bench.py to compare PR 22679 to Master, and the results are in > bench_results.txt. Results were varied. I suppose this depends on what cases > we want to optimize for. Lazy me: can you please use render results as a table? Use something

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-12 Thread Dennis Sweeney
Dennis Sweeney added the comment: I used random_bench.py to compare PR 22679 to Master, and the results are in bench_results.txt. Results were varied. I suppose this depends on what cases we want to optimize for. -- Added file: https://bugs.python.org/file49512/random_bench.py

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-12 Thread Dennis Sweeney
Dennis Sweeney added the comment: PR 22679 is a draft that does the two-way algorithm but also adds both of the tricks from Fredrik's implementation: a bit-set "bloom filter" and remembering the skip-distance between some pair of characters. -- Added file:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-12 Thread Dennis Sweeney
Change by Dennis Sweeney : -- pull_requests: +21650 stage: -> patch review pull_request: https://github.com/python/cpython/pull/22679 ___ Python tracker ___

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-12 Thread Dennis Sweeney
Dennis Sweeney added the comment: Here is a C implementation of the two-way algorithm that should work as a drop-in replacement for Objects/stringlib/fastsearch.h. Benchmarking so far, it looks like it is a bit slower in a lot of cases. But it's also a bit faster in a some other cases and

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-11 Thread Tim Peters
Tim Peters added the comment: Impressive, Dennis! Nice work. FYI, on the OP's original test data, your fastsearch() completes each search in under 20 seconds using CPython, and in well under 1 second using PyPy. Unfortunately, that's so promising it can't just be dismissed offhandedly ;-)

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-11 Thread Dennis Sweeney
Dennis Sweeney added the comment: > Offhand do you know what the _best_ timing for two-way search is in a > pattern-not-found case? Looking at the glibc implementation, in the top-level "else" clause (for when the string isn't completely periodic), we have: period = MAX (suffix,

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-10 Thread Tim Peters
Tim Peters added the comment: Ya, the case for the diff is at best marginal. Note that while it may be theoretically provable that the extra test would make the worst cases slower, that's almost certainly not measurable. The extra test would almost never be executed in the worst cases: in

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-10 Thread Dennis Sweeney
Dennis Sweeney added the comment: I agree that skip could could do 1 better. --- > I don't know whether it "should be" applied I don't think I'm convinced: the second check fixes only the very specific case when s[len(p):].startswith(p). Perturbations of reproducer.py are still very slow

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-09 Thread Tim Peters
Tim Peters added the comment: The attached fastsearch.diff removes the speed hit in the original test case and in the constructed one. I don't know whether it "should be" applied, and really can't make time to dig into it. The rationale: when the last characters of the pattern string and

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread Tim Peters
Tim Peters added the comment: BTW, this initialization in the FASTSEARCH code appears to me to be a mistake: skip = mlast - 1; That's "mistake" in the sense of "not quite what was intended, and so confusing", not in the sense of "leads to a wrong result". I believe `skip` should be

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread Tim Peters
Tim Peters added the comment: Just FYI, the original test program did get the right answer for the second search on my box - after about 3 1/2 hours :-) -- ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread Tim Peters
Tim Peters added the comment: Good sleuthing, Dennis! Yes, Fredrik was not willing to add "potentially expensive" (in time or in space) tricks: http://effbot.org/zone/stringlib.htm So worst-case time is proportional to the product of the arguments' lengths, and the cases here appear to,

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread Dennis Sweeney
Dennis Sweeney added the comment: Indeed, this is just a very unlucky case. >>> n = len(longer) >>> from collections import Counter >>> Counter(s[:n]) Counter({0: 9056995, 255: 6346813}) >>> s[n-30:n+30].replace(b'\x00', b'.').replace(b'\xff', b'@')

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread STINNER Victor
STINNER Victor added the comment: Does the code really hang? Or does it complete if you wait for an infinite amount of time? I understand that your concern is that bytes.find() is inefficient for a very specific pattern. -- nosy: +vstinner ___

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread Dennis Sweeney
Dennis Sweeney added the comment: Adding some hasty printf-debugging to fastsearch.h, I see this: >>> with open('data.bin', 'rb') as f: ... s = f.read() ... >>> base = 15403807 * b'\xff' >>> longer = base + b'\xff' >>> base in s Bloom has 0x00: 0 Bloom

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread Ammar Askar
Ammar Askar added the comment: It's available from the Github gist that Kevin posted. Here is a direct link that you can curl/wget: https://gist.github.com/Zeturic/7d0480a94352968c1fe92aa62e8adeaf/raw/6daebaabedaa903016810c2c04d0d1f0b1af1ed3/data.bin -- nosy: +ammar2

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: What is the content of "data.bin"? How can we reproduce the issue? -- nosy: +serhiy.storchaka versions: +Python 3.10 ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-07 Thread pmp-p
Change by pmp-p : -- nosy: +pmpp ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-07 Thread Tim Peters
Tim Peters added the comment: Also reproduced on 64-bit Win10 with just-released 3.9.0. Note that string search tries to incorporate a number of tricks (pieces of Boyer-Moore, Sunday, etc) to speed searches. The "skip tables" here are probably computing a 0 by mistake. The file here

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-07 Thread Josh Rosenberg
Change by Josh Rosenberg : -- type: performance -> behavior ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-07 Thread Josh Rosenberg
Josh Rosenberg added the comment: Can reproduce on Alpine Linux, with CPython 3.8.2 (running under WSLv2), so it's not just you. CPU usage is high; seems like it must be stuck in an infinite loop. -- nosy: +josh.r ___ Python tracker

[issue41972] bytes.find consistently hangs in a particular scenario

2020-10-07 Thread Kevin Mills
New submission from Kevin Mills : Sorry for the vague title. I'm not sure how to succinctly describe this issue. The following code: ``` with open("data.bin", "rb") as f: data = f.read() base = 15403807 * b'\xff' longer = base + b'\xff' print(data.find(base)) print(data.find(longer))