https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

       Web browser: ---
             Bug #: 39501
           Summary: Merging Unicode apostrophe-like characters in internal
                    search
           Product: MediaWiki extensions
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: Lucene Search
        AssignedTo: [email protected]
        ReportedBy: [email protected]
                CC: [email protected]
    Classification: Unclassified
   Mobile Platform: ---


When doing a search with the apostrophe character U+0027 "apostrophe/single
quote" available on most keyboard, results should match other Unicode
apostrophe-like characters like the preferred apostrophe U+2019 and others.

In 2009 there was a discussion about "Different apostrophe signs and MediaWiki
internal search" see
http://www.gossamer-threads.com/lists/wiki/wikitech/169177
This doesn't seem to have been implemented.

This is related to bug 36313 for autocompletion.

Basically indexing should convert all apostrophes to U+0027, and searching
should convert all apostrophes to U+0027. So articles containing U+2019 for
exemple would be matches when search with U+0027, U+2019 or other apostrophes.

>From the 2009 discussion, the list of apostrophes was:
U+0027 APOSTROPHE 
U+2018 LEFT SINGLE QUOTATION MARK 
U+2019 RIGHT SINGLE QUOTATION MARK 
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK 
U+2032 PRIME 
U+00B4 ACUTE ACCENT 
U+0060 GRAVE ACCENT 
U+FF40 FULLWIDTH GRAVE ACCENT 
U+FF07 FULLWIDTH APOSTROPHE

I would add other characters for which U+0027 is often used as an accessible
substitute like some modifier letters and saltillo:
U+02B9 MODIFIER LETTER PRIME
U+02BB MODIFIER LETTER TURNED COMMA
U+02BC MODIFIER LETTER APOSTROPHE
U+02BD MODIFIER LETTER REVERSED COMMA
U+02BE MODIFIER LETTER RIGHT HALF RING
U+02BF MODIFIER LETTER LEFT HALF RING
U+0384 GREEK TONOS
U+1FBF GREEK PSILI
U+A78B LATIN CAPITAL LETTER SALTILLO
U+A78C LATIN SMALL LETTER SALTILLO

Webkit-based browsers already do this kind of stripping and merge U+0027,
U+2018, U+2019, U+FF07. However there are many cases where merge all the
proposed characters would help regular keyboard input.

The proposed solution in 2009 was to use a strip function:
function stripForSearch( $string ) { 
$s = preg_replace( '/\xe2\x80\x99/', '\'', $string ); 
return parent::stripForSearch( $s );

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to