On Sat, Jun 20, 2009 at 9:46 PM, Neil Harris<[email protected]> wrote:
>> Regarding dashes and hyphens, I've now found my original data set, and
>> a quick inspection gives this set of various similar-looking Latin
>> hyphens, dashes and minus signs:
>> U+002D HYPHEN-MINUS
>> U+2010 HYPHEN
>> U+2011 NON-BREAKING HYPHEN
>> U+2012 FIGURE DASH
>> U+2013 EN DASH
>>
> and at this point I missed out U+2014 EM DASH , which was hiding in the
> world of transitive closure mentioned below...
>> U+2212 MINUS SIGN
>> U+FE58 SMALL EM DASH
>> U+FF0D FULLWIDTH HYPHEN-MINUS

I think you have to be mindful of the original goal here: for each
character a user is likely to enter from their keyboard in the search
box, what possible range of characters would they expect to match?

So, apostrophe (U+0027) -> curved right single quote (U+2019): yes, probably.
The other way around...probably not, unless that U+2019 exists on any keyboards.

Hyphen-minus (U+002D) -> em dash (U+2014): I would say no. If you
search for "clock-work", you probably don't want to match a sentence
like "He was building a clock—work that is never easy—at the time."
(contrived, sure)

Just saying you probably don't want the full range of "lookalikes" -
the left side of each mapping should be a keyboard character, and the
right side should be semantically equivalent, unless commonly used
incorrectly.

Steve

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to