> On Oct 17, 2020, at 2:40 PM, Tim Peters wrote:
>
> Still waiting for someone who thinks string search speed is critical
> in their real app to give it a try. In the absence of that, I endorse
> merging this.
Be bold. Merge it. :-)
Raymond
__
On 17/10/20 3:26 pm, Tim Peters wrote:
Tal Einat posted earlier that he was keen to try to work up a clear
explanation, and I look forward to that. All the expositions I've
found of this algorithm so far are hard to approach.
Maybe Mathologer or 3blue1brown could be persuaded to help?
They seem
[Tim]
> ...
> Alas, the higher preprocessing costs leave the current PR slower in "too
> many" cases too, especially when the needle is short and found early
> in the haystack. Then any preprocessing cost approaches a pure waste
> of time.
But that was this morning. Since then, Dennis changed the
[Tim]
>> Note that no "extra" storage is needed to exploit this. No character
>> lookups, no extra expenses in time or space of any kind. Just "if we
>> mismatch on the k'th try, we can jump ahead k positions".
[Antoine Pitrou ]
> Ok, so that means that on a N-character haystack, it'll always do
On Fri, 16 Oct 2020 20:24:22 -0500
Tim Peters wrote:
>
> Note that no "extra" storage is needed to exploit this. No character
> lookups, no extra expenses in time or space of any kind. Just "if we
> mismatch on the k'th try, we can jump ahead k positions".
Ok, so that means that on a N-characte
[Tim Peters, explains one of the new algorithm's surprisingly
effective moving parts]
[Chris Angelico ]
> Thank you, great explanation. Can this be added to the source code
> if/when this algorithm gets implemented?
No ;-) While I enjoy trying to make hard things clear(er), I need to
understand
On Sat, Oct 17, 2020 at 12:30 PM Tim Peters wrote:
>
> I don't plan on making a series of these posts, just this one, to give
> people _some_ insight into why the new algorithm gets systematic
> benefits the current algorithm can't. It splits the needle into two
> pieces, u and v, very carefully
I don't plan on making a series of these posts, just this one, to give
people _some_ insight into why the new algorithm gets systematic
benefits the current algorithm can't. It splits the needle into two
pieces, u and v, very carefully selected by subtle linear-time needle
preprocessing (and it's
[Marco Sulla]
> Excuse me if I intrude in an algorithm that I have not understood, but
> the new optimization can be applied to regexps too?
The algorithm is limited to searching for fixed strings.
However, _part_ of our regexp implementation (the bit that looks ahead
for a fixed string) will inh
Excuse me if I intrude in an algorithm that I have not understood, but
the new optimization can be applied to regexps too?
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python
[Guido]
> I am not able to dream up any hard cases -- like other posters,
> my own use of substring search is usually looking for a short
> string in a relatively short piece of text. I doubt even the current
> optimizations matter to my uses.
I should have responded to this part differently. W
[Dennis Sweeney ]
> Here's my attempt at some heuristic motivation:
Thanks, Dennis! It helps. One gloss:
>
> The key insight though is that the worst strings are still
> "periodic enough", and if we have two different patterns going on,
> then we can intentionally split them apart.
The am
Here's my attempt at some heuristic motivation:
Try to construct a needle that will perform as poorly as possible when
using the naive two-nested-for-loops algorithm. You'll find that if
there isn't some sort of vague periodicity in your needle, then you
won't ever get *that* unlucky; each particu
On 15/10/20 1:45 pm, Chris Angelico wrote:
So it'd
be heuristics in the core language that choose a good default for most
situations, and then a str method that returns a preprocessed needle.
Or maybe cache the results of the preprocessing?
--
Greg
_
[Guido]
> The key seems to be:
Except none of that quoted text (which I'll skip repeating) gives the
slightest clue as to _why_ it may be an improvement. So you split the
needle into two pieces. So what? What's the _point_? Why would
someone even imagine that might help?
Why is one half then
On Wed., Oct. 14, 2020, 17:37 Tim Peters, wrote:
> [Steven D'Aprano ]
> > Perhaps this is a silly suggestion, but could we offer this as an
> > external function in the stdlib rather than a string method?
> >
> > Leave it up to the user to decide whether or not their data best suits
> > the find
On Wed, Oct 14, 2020 at 9:56 AM Tim Peters wrote:
> [Guido]
> > Maybe someone reading this can finish the Wikipedia page on
> > Two-Way Search? The code example trails off with a function with
> > some incomprehensible remarks and then a TODO..
>
> Yes, the Wikipedia page is worse than useless in
On Thu, Oct 15, 2020 at 11:38 AM Tim Peters wrote:
> I think this is premature. There is almost never an optimization
> that's a pure win in all cases. For example, on some platforms
> `timsort` will never be as fast as the old samplesort in cases with a
> very large number of equal elements, and
[Steven D'Aprano ]
> Perhaps this is a silly suggestion, but could we offer this as an
> external function in the stdlib rather than a string method?
>
> Leave it up to the user to decide whether or not their data best suits
> the find method or the new search function. It sounds like we can offer
On Wed, Oct 14, 2020 at 7:45 PM Steven D'Aprano wrote:
> Perhaps this is a silly suggestion, but could we offer this as an
> external function in the stdlib rather than a string method?
>
That feels unworkable to me.
For one thing, the 'in' operator hits this same issue, doesn't it? But for
ano
Perhaps this is a silly suggestion, but could we offer this as an
external function in the stdlib rather than a string method?
Leave it up to the user to decide whether or not their data best suits
the find method or the new search function. It sounds like we can offer
some rough heuristics, bu
On Wed, Oct 14, 2020 at 7:57 PM Tim Peters wrote:
>
> [Guido]
> > Maybe someone reading this can finish the Wikipedia page on
> > Two-Way Search? The code example trails off with a function with
> > some incomprehensible remarks and then a TODO..
>
> Yes, the Wikipedia page is worse than useless i
[Guido]
> Maybe someone reading this can finish the Wikipedia page on
> Two-Way Search? The code example trails off with a function with
> some incomprehensible remarks and then a TODO..
Yes, the Wikipedia page is worse than useless in its current state,
although some of the references it lists ar
Maybe someone reading this can finish the Wikipedia page on Two-Way Search?
The code example trails off with a function with some incomprehensible
remarks and then a TODO...
On Wed, Oct 14, 2020 at 9:07 AM Tim Peters wrote:
> Rest assured that Dennis is aware of that pragmatics may change for
>
Rest assured that Dennis is aware of that pragmatics may change for
shorter needles.
The code has always made a special-case of 1-character needles,
because it's impossible "even in theory" to improve over
straightforward brute force search then.
Say the length of the text to search is `t`, and t
I'm sure that the large majority of the string searches I've done are in
Larry's tiny category.
However, I also think that big data needs are increasing, and things like
FASTA files can be enormously large texts that one has good reasons to
search on.
If there is a heuristic switch between algori
On 10/13/20 9:54 AM, Tim Peters wrote:
Due to the natures of the old and new algorithms, neither will be
faster or slower in all cases, the newer one should never be
dramatically slower than the old one, the new one will be
spectacularly faster in some cases, and "in general" it seems
impossible
27 matches
Mail list logo