[issue46667] SequenceMatcher & autojunk - false negative

2022-02-07 Thread Tim Peters


Tim Peters  added the comment:

We can't change defaults without superb reason - Python has millions of users, 
and changing the output of code "that works" is almost always a non-starter.

Improvements to the docs are welcome.

In your example, try running this code after using autojunk=True:

pending = ""
for ch in first:
if ch in sm.bpopular:
if pending:
print(repr(pending))
pending = ""
else:
pending += ch
print(repr(pending))

That shows how `first` is effectively broken into tiny pieces given that the 
"popular" chaaracters act like walls. Here's the start of the output:

'\nUN'
'QUESTR'
'NG\nL'
'x'
'f'
'.'
'L'
'b'
"'"
'x'
'v'
'1500'
','

and on & on. `QUESTER' is the longest common contiguous substring remaining.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46667] SequenceMatcher & autojunk - false negative

2022-02-07 Thread Jonathan


Jonathan  added the comment:

I still don't get how UNIQUESTRING is the longest even with autojunk=True, but 
that's an implementation detail and I'll trust you that it's working as 
expected.

Given this, I'd suggest the following then:

* `Autojunk=False` should be the default unless there's some reason to believe 
SequenceMatcher is mostly used for code comparisons.

* If - for whatever reason - the default can't be changed, I'd suggest a nice 
big docs "Warning" (at a minimum a "Note") saying something like "The default 
autojunk=True is not suitable for normal string comparison. See autojunk for 
more information".

* Human-friendly doc explanation for autojunk. The current explanation is only 
going to be helpful to the tiny fraction of users who understand the algorithm. 
Your explanation is a good start:
"Autojunk was introduced as a way to greatly speed comparing files of 
code, viewing them as sequences of lines. But it more often backfires when 
comparing strings (viewed as sequences of characters)"

Put simply: The current docs aren't helpful to users who don't have text 
matching expertise, nor do they emphasise the huge caveat that autojunk=True 
raises.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46667] SequenceMatcher & autojunk - false negative

2022-02-06 Thread Tim Peters


Tim Peters  added the comment:

SequenceMatcher looks for the longest _contiguous_ match. "UNIQUESTRING" isn't 
the longest by far when autojunk is False, but is the longest when autojunk is 
True. All those bpopular characters then effectively prevent finding a longer 
match than 'QUESTR' (capital 'I" is also in bpopular) directly.

The effects of autojunk can be surprising, and it would have been better if it 
were False by default. But I don't see anything unexpected here. Learn from 
experience and force it to False yourself ;-) BTW, it was introduced as a way 
to greatly speed comparing files of code, viewing them as sequences of lines. 
In that context, autojunk is rarely surprising and usually helpful. But it more 
often backfires when comparing strings (viewed as sequences of characters) :-(

--
nosy: +tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46667] SequenceMatcher & autojunk - false negative

2022-02-06 Thread Jonathan


Jonathan  added the comment:

Gah. I mean 0.008 in both directions. I'm just going to be quiet now. :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46667] SequenceMatcher & autojunk - false negative

2022-02-06 Thread Jonathan


Jonathan  added the comment:

(Like the idiot I am, the example code is wrong. `autojunk` parameter should 
*not* be set for either of them to get the stated wrong results).

In place of "UNIQUESTRING", any unique 3 character string triggers it (QQQ, 
EEE, ZQU...). And in those cases you get a ratio of 0.008! (and 0.993 in the 
other direction!)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46667] SequenceMatcher & autojunk - false negative

2022-02-06 Thread Jonathan


New submission from Jonathan :

The following two strings are identical other than the text "UNIQUESTRING".
UNIQUESTRING is at the start of first and at the end of second.
Running the below gives the following output:


0.99830220713073
0.99830220713073
0.023769100169779286  # ratio

0.99830220713073
0.99830220713073
0.023769100169779286  # ratio

As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces 
and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct)


```
from difflib import SequenceMatcher

first = """
UNIQUESTRING
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type 
specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in 
the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker 
including versions of Lorem Ipsum
"""


second = """

Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type 
specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in 
the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker 
including versions of Lorem Ipsum  UNIQUESTRING
"""

sm = SequenceMatcher(None, first, second, autojunk=False)
print(sm.real_quick_ratio())
print(sm.quick_ratio())
print(sm.ratio())

print()

sm2 = SequenceMatcher(None, second, first, autojunk=False)
print(sm2.real_quick_ratio())
print(sm2.quick_ratio())
print(sm2.ratio())

```

If I add `autojunk=False`, then I get a correct looking ratio (0.98...), 
however from my reading of the autojunk docs, UNIQUESTRING shouldn't be 
triggering it. Furthermore, looking in the code, as far as I can see autojunk 
is having no effect...

Autojunk considers these items to be "popular" in that string:
`{'n', 'p', 'a', 'h', 'e', 'u', 'I', 'r', 'k', 'g', 'y', 'm', 'c', 'd', 't', 
'l', 'o', 's', ' ', 'i'}`

If I remove UNIQUESTRING from `first`, this is the autojunk popular set:
`{'c', 'p', 'a', 'u', 'r', 'm', 'k', 'g', 'I', 'd', ' ', 'o', 'h', 't', 'e', 
'i', 'l', 's', 'y', 'n'}`

They're identical!

In both scenarios, `b2j` is also identical.

I don't pretend to understand what the module is doing in any detail, but this 
certainly seems like a false positive/negative.

Python 3.8.10

--
components: Library (Lib)
messages: 412673
nosy: jonathan-lp
priority: normal
severity: normal
status: open
title: SequenceMatcher & autojunk - false negative
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com