https://bugzilla.wikimedia.org/show_bug.cgi?id=41577
Web browser: ---
Bug #: 41577
Summary: Use normalized search key in term search index
Product: MediaWiki extensions
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: Unprioritized
Component: WikidataRepo
AssignedTo: [email protected]
ReportedBy: [email protected]
CC: [email protected],
[email protected]
Classification: Unclassified
Mobile Platform: ---
The term search index currently uses on-the-fly conversion to utf8 (and then
lower case) to perform comparisons. That means a full table scan followed by a
file sort on a table that is likely to contain several dozen million rows.
That's likely to kill the DB server.
To avoid this, there should be a dedicated search key column holding the
normalized key (similar to the way a search key column is used for category
sorting and finding external links). The same normalization shall apply to the
index term when inserted and the search term when generating the query. In
particular, the following normalization shall apply:
* unicode normalization (NFC)
* trim leading and trailing whitespace (ideally, all unicode whitespace chars)
* lower case (ideally, using the implementation from the appropriate Language
class).
* optionally, apply a configurable regular expression for stripping separators
(e.g. per default stripping all internal whitespace and hyphens, so "foobar"
will match "foo-bar" and "foo bar").
This will provide case-insensitive matches with some flexibility regarding
whitespace, etc. If only exact matches are desired, the "soft" result could be
filtered programmatically before returning it to the caller.
--
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l