Andrew Dunbar wrote: > 2009/6/20 Neil Harris <[email protected]>: > >> Neil Harris wrote: >> >>> Andrew Dunbar wrote: >>> >>> >>>> 2009/6/20 Jaska Zedlik <[email protected]>: >>>> >>>> >>>> >>>>> Hello, >>>>> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <[email protected]> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Jaska Zedlik skrev: >>>>>> <...> >>>>>> >>>>>> >>>>>> >>>>>>> The code of the override function is the following: >>>>>>> >>>>>>> function stripForSearch( $string ) { >>>>>>> $s = $string; >>>>>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $s ); >>>>>>> return parent::stripForSearch( $s ); >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>> I'm not a PHP programmer, but why using the extra assignment of $s >>>>>> instead of using $string directly in the parent call, like so: >>>>>> >>>>>> function stripForSearch( $string ) { >>>>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $string ); >>>>>> return parent::stripForSearch( $s ); >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Really, you are right, for the real function all these redundant >>>>> assignments >>>>> should be strepped for the productivity purposes, I just used a framework >>>>> from the Japanese language class which does soma Japanese-specific >>>>> reduction, but I agree with your notice. >>>>> >>>>> >>>>> >>>> The username anti-spoofing code already knows about a lot of "looks >>>> similar" >>>> characters which may be of some help. >>>> >>>> Andrew Dunbar (hippietrail) >>>> >>>> >>>> >>>> >>>> >>> Of itself, the username anti-spoofing code table -- which I originally >>> wrote -- is rather too thorough for this purpose, since it deliberately >>> errs on the side of mapping even vaguely similar-looking characters to >>> one another, regardless of character type and script system,and this, >>> combined with case-folding and transitivity, leads to some apparently >>> bizarre mappings that are of no practical use for any other application. >>> >>> If you're interested, I can take a look at producing a more limited >>> punctuation-only version. >>> >>> -- Neil >>> >>> >>> >> http://www.unicode.org/reports/tr39/data/confusables.txt is probably the >> single best source for information about visual confusables. >> >> Staying entirely within the Latin punctuation repertoire, and avoiding >> combining characters and other exotica such as math characters and >> dingbats, you might want to consider the following characters as >> possible unintentional lookalikes for the apostrophe: >> >> U+0027 APOSTROPHE >> U+2019 RIGHT SINGLE QUOTATION MARK >> U+2018 LEFT SINGLE QUOTATION MARK >> U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK >> U+2032 PRIME >> U+00B4 ACUTE ACCENT >> U+0060 GRAVE ACCENT >> U+FF40 FULLWIDTH GRAVE ACCENT >> U+FF07 FULLWIDTH APOSTROPHE >> >> There are also lots of other characters that look like these from other >> languages, and various combining character combinations which could also >> look the same, but I doubt whether they would be generated in Latin text >> by accident. >> > > I would add > U+02BB MODIFIER LETTER TURNED COMMA (hawaiian 'okina) > U+02C8 MODIFIER LETTER VERTICAL LINE (IPA primary stress mark) > > It might be worthwhile folding some dashes and hyphens too. > > Andrew Dunbar (hippietrail) >
Interestingly, following up the above, I've found one source (http://snowball.tartarus.org/texts/apostrophe.html) that states that U+201B may be deliberately used as an apostrophe variant by some publishers in some contexts. Regarding dashes and hyphens, I've now found my original data set, and a quick inspection gives this set of various similar-looking Latin hyphens, dashes and minus signs: U+002D HYPHEN-MINUS U+2010 HYPHEN U+2011 NON-BREAKING HYPHEN U+2012 FIGURE DASH U+2013 EN DASH U+2212 MINUS SIGN U+FE58 SMALL EM DASH U+FF0D FULLWIDTH HYPHEN-MINUS I can send the full data set of lookalikes to anyone who is interested: it can be quite easily extended by regarding the relation "looks like" as transitive, to include more distant and linguistically dubious visual confusables such as (just for example) U+2015 HORIZONTAL BAR, U+1173 HANGUL JUNGSEONG EU and U+2F00 KANGXI RADICAL ONE. -- Neil _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
