Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search

Neil Harris Sat, 20 Jun 2009 04:38:18 -0700

Andrew Dunbar wrote:
> 2009/6/20 Neil Harris <[email protected]>:
>   
>> Neil Harris wrote:
>>     
>>> Andrew Dunbar wrote:
>>>
>>>       
>>>> 2009/6/20 Jaska Zedlik <[email protected]>:
>>>>
>>>>
>>>>         
>>>>> Hello,
>>>>> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Jaska Zedlik skrev:
>>>>>> <...>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> The code of the override function is the following:
>>>>>>>
>>>>>>> function stripForSearch( $string ) {
>>>>>>>   $s = $string;
>>>>>>>   $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
>>>>>>>   return parent::stripForSearch( $s );
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> I'm not a PHP programmer, but why using the extra assignment of $s
>>>>>> instead of using $string directly in the parent call, like so:
>>>>>>
>>>>>> function stripForSearch( $string ) {
>>>>>>     $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
>>>>>>     return parent::stripForSearch( $s );
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>> Really, you are right, for the real function all these redundant 
>>>>> assignments
>>>>> should be strepped for the productivity purposes, I just used a framework
>>>>> from the Japanese language class which does soma Japanese-specific
>>>>> reduction, but I agree with your notice.
>>>>>
>>>>>
>>>>>           
>>>> The username anti-spoofing code already knows about a lot of "looks 
>>>> similar"
>>>> characters which may be of some help.
>>>>
>>>> Andrew Dunbar (hippietrail)
>>>>
>>>>
>>>>
>>>>
>>>>         
>>> Of itself, the username anti-spoofing code table -- which I originally
>>> wrote -- is rather too thorough for this purpose, since it deliberately
>>> errs on the side of mapping even vaguely similar-looking characters to
>>> one another, regardless of character type and script system,and this,
>>> combined with case-folding and transitivity, leads to some apparently
>>> bizarre mappings that are of no practical use for any other application.
>>>
>>> If you're interested, I can take a look at producing a more limited
>>> punctuation-only version.
>>>
>>> -- Neil
>>>
>>>
>>>       
>> http://www.unicode.org/reports/tr39/data/confusables.txt is probably the
>> single best source for information about visual confusables.
>>
>> Staying entirely within the Latin punctuation repertoire, and avoiding
>> combining characters and other exotica such as math characters and
>> dingbats, you might want to consider the following characters as
>> possible unintentional lookalikes for the apostrophe:
>>
>> U+0027 APOSTROPHE
>> U+2019 RIGHT SINGLE QUOTATION MARK
>> U+2018 LEFT SINGLE QUOTATION MARK
>> U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
>> U+2032 PRIME
>> U+00B4 ACUTE ACCENT
>> U+0060 GRAVE ACCENT
>> U+FF40 FULLWIDTH GRAVE ACCENT
>> U+FF07 FULLWIDTH APOSTROPHE
>>
>> There are also lots of other characters that look like these from other
>> languages, and various combining character combinations which could also
>> look the same, but I doubt whether they would be generated in Latin text
>> by accident.
>>     
>
> I would add
> U+02BB MODIFIER LETTER TURNED COMMA (hawaiian 'okina)
> U+02C8 MODIFIER LETTER VERTICAL LINE (IPA primary stress mark)
>
> It might be worthwhile folding some dashes and hyphens too.
>
> Andrew Dunbar (hippietrail)
>


Interestingly, following up the above, I've found one source 
(http://snowball.tartarus.org/texts/apostrophe.html) that states that 
U+201B may be deliberately used as an apostrophe variant by some 
publishers in some contexts.

Regarding dashes and hyphens, I've now found my original data set, and a 
quick inspection gives this set of various similar-looking Latin 
hyphens, dashes and minus signs:

U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
U+2212 MINUS SIGN
U+FE58 SMALL EM DASH
U+FF0D FULLWIDTH HYPHEN-MINUS

I can send the full data set of lookalikes to anyone who is interested: 
it can be quite easily extended by regarding the relation "looks like" 
as transitive, to include more distant and linguistically dubious visual 
confusables such as (just for example) U+2015 HORIZONTAL BAR, U+1173 
HANGUL JUNGSEONG EU and U+2F00 KANGXI RADICAL ONE.

-- Neil




_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search

Reply via email to