Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search

Neil Harris Sat, 20 Jun 2009 03:36:59 -0700

Neil Harris wrote:
> Andrew Dunbar wrote:
>   
>> 2009/6/20 Jaska Zedlik <[email protected]>:
>>   
>>     
>>> Hello,
>>> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <[email protected]> wrote:
>>>
>>>     
>>>       
>>>> Jaska Zedlik skrev:
>>>> <...>
>>>>       
>>>>         
>>>>> The code of the override function is the following:
>>>>>
>>>>> function stripForSearch( $string ) {
>>>>>   $s = $string;
>>>>>   $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
>>>>>   return parent::stripForSearch( $s );
>>>>> }
>>>>>         
>>>>>           
>>>> I'm not a PHP programmer, but why using the extra assignment of $s
>>>> instead of using $string directly in the parent call, like so:
>>>>
>>>> function stripForSearch( $string ) {
>>>>     $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
>>>>     return parent::stripForSearch( $s );
>>>> }
>>>>
>>>>       
>>>>         
>>> Really, you are right, for the real function all these redundant assignments
>>> should be strepped for the productivity purposes, I just used a framework
>>> from the Japanese language class which does soma Japanese-specific
>>> reduction, but I agree with your notice.
>>>     
>>>       
>> The username anti-spoofing code already knows about a lot of "looks similar"
>> characters which may be of some help.
>>
>> Andrew Dunbar (hippietrail)
>>
>>
>>   
>>     
> Of itself, the username anti-spoofing code table -- which I originally 
> wrote -- is rather too thorough for this purpose, since it deliberately 
> errs on the side of mapping even vaguely similar-looking characters to 
> one another, regardless of character type and script system,and this, 
> combined with case-folding and transitivity, leads to some apparently 
> bizarre mappings that are of no practical use for any other application.
>
> If you're interested, I can take a look at producing a more limited 
> punctuation-only version.
>
> -- Neil
>
>   
http://www.unicode.org/reports/tr39/data/confusables.txt is probably the 
single best source for information about visual confusables.


Staying entirely within the Latin punctuation repertoire, and avoiding 
combining characters and other exotica such as math characters and 
dingbats, you might want to consider the following characters as 
possible unintentional lookalikes for the apostrophe:

U+0027 APOSTROPHE
U+2019 RIGHT SINGLE QUOTATION MARK
U+2018 LEFT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+2032 PRIME
U+00B4 ACUTE ACCENT
U+0060 GRAVE ACCENT
U+FF40 FULLWIDTH GRAVE ACCENT
U+FF07 FULLWIDTH APOSTROPHE

There are also lots of other characters that look like these from other 
languages, and various combining character combinations which could also 
look the same, but I doubt whether they would be generated in Latin text 
by accident.

Please check these against the actual code tables for reasonableness and 
accuracy before putting them in any code.

-- Neil


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search

Reply via email to