#266: Backward search doesn't find underlined text
--+-
Reporter: egmont | Owner:
Type: defect | Status: new
Priority: minor| Milestone: 4.7
Component: mc-core | Version: 4.6.2
Resolution: |Keywords:
Blocking: | Blockedby:
--+-
Comment(by egmont):
I was thinking on approaches to fix this issue, as well as these other
forthcoming issues:
- basic utf8 support (included in the utf8 patch)
- case insensitive search in utf8 (tricky, not included in the current
utf8 patch)
- combination with manpage formats (that is, to be hardcore, have a utf8
manpage with backspace notation for bold/underlined, and do case
insensitive forward/backward search in this). Even displaying a utf8
manpage is not implemented in mc + the utf8 patch, but I have a patch for
this. However, my patch breaks when it comes to searching highlighted
text; searching simple text is case sensitive for accented letters; and
probably horribly breaks for backward search (never tried).
Basically there are two approaches for text search. Summarizing them:
- The "always fast" one: Build up a state machine first, based on the
needle string. Then just go through the whole input once.
- The simple one: See whether it matches from the given offset. If not,
try from the next offset.
Let's see these in detail.
The "always fast" approach, with the state machine:
The advantage is: no matter what the needle is and how large it is, the
file has to be processed only once, constant amount of memory is needed
and runtime is always linear to the length of the file.
Hard to implement. By each and every newly added feature it becomes even
more complicated. Man page format, utf8, case insensitiveness, backwards
option, and all combinations thereof make it extremely complicated.
As long as the only "extra" option was utf8, backward could be implemented
by reversing both haystack and needle. However, as soon as either manpage
format, or utf8 case insensitive search is a requirement, this no longer
works, since the state machine would have to be different anyway for the
reversed string. And there's no point in reversing the whole haystack and
needle in either cases: it's way easier and faster to simply walk
downwards in memory.
The simple solution:
The fear of the simple solution could be that under some circumstances it
might be slow. E.g. if both the needle and haystack are a series of letter
A's then it will require H*N time (H and N stand for haystack and needle
size). This is worst case in algorithmical sense, but not a typical use
case.
Note that with typical user-entered search strings, the run-time is still
linear (to the filesize), does not grow as you enter longer string. Even
if the needle is a random string, the average runtime is linear.
So in real life usage I think that this algorithm is absolutely perfect
for us. If we were about to write software that was potentially DoS-able
(such as a public web-based searching library), this would be an issue.
Within mc, where the user types the short query string, this algorithm is
perfect.
Okay, let's suppose we have manpage format, utf8, case insensitiveness and
all combination of them implemented in forward only mode with this
algorithm. And we were about to add backward searching capability.
Reversing the haystack and needle raises endless numbers of headaches
(different approach for manpage format, different tricks for utf8, no out-
of-the-box utf8 library usable anymore etc.) And what would we gain with
it? Absolutely nothing! Why?
Because what's a "backward search" from the user's point of view, can
simply be implemented as a loop of forward matches. The inner loop, which
is a kind of clever strcmp (with features for manpage format, utf8, case
insensitiveness and what not) can still go forward in memory. Only one
minor thing has to be changed: the outer loop, which tries out the
different matching positions, has to go downwards. Absolutely trivial
change.
Conclusion:
We don't need any super clever algorithm. The simplest way of implementing
search is absolutely perfect.
Reversing strings is a completely wrong idea. We should get rid of it, and
simply change the outer loop of _icase_search() to go downwards (and of
course adjust parameters and such). (Note that _icase_search() is
currently implemented as a single loop, which sometimes resets the loop
variable to a previous value, but logically what it does is way easier to
imagine as two nested loops.) This would automatically fix the original
bug of this ticket.
Once it's done (shouldn't be a hard change), I see two independent issues
that would still need to be solved, both are for the utf8 patch only:
support case insensitive utf