[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2014-01-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

Nik Everett neverett+bugzi...@wikimedia.org changed:

   What|Removed |Added

   Priority|Lowest  |Normal

--- Comment #11 from Nik Everett neverett+bugzi...@wikimedia.org ---
I'll have a look at this when I can.  For now I'll leave the component set to
CirrusSearch.  It looks like PHP implements the same normalization components
that I can use in Elasticsearch (http://php.net/manual/en/class.normalizer.php)
so I'll have to evaluate doing that normalization there as well.  I imagine
we'll if we do it in php it'll have to be optional because the normalizer
requires PHP 5 = 5.3.0 and PECL intl = 1.0.0.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2014-01-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

Nik Everett neverett+bugzi...@wikimedia.org changed:

   What|Removed |Added

   See Also||https://bugzilla.wikimedia.
   ||org/show_bug.cgi?id=59666

--- Comment #12 from Nik Everett neverett+bugzi...@wikimedia.org ---
In case anyone comes to this from
http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx#Pic-5,
they should have a look at Bug 59666 which should plug that particular
embarrassing hole.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2014-01-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

MZMcBride b...@mzmcbride.com changed:

   What|Removed |Added

 CC||b...@mzmcbride.com,
   ||legoktm.wikipe...@gmail.com
   ||, matma@gmail.com

--- Comment #9 from MZMcBride b...@mzmcbride.com ---
Looks like apostrophes came up on The Daily WTF:
http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx
(specifically http://img.thedailywtf.com/images/14/q1/e95/Pic-5.jpg).

(In reply to comment #6)
 Were you thinking this should be done in Cirrus for all languages by pushing
 analysis configuration to Elasticsearch?  Something along those lines would
 be pretty flexible, allowing, for example, us to boost perfect matches of the
 typed unicode characters above the squashed ones.

We already do some input normalization at some level of the stack (for example,
multiple underscores get squashed and input such as AbrAhAm LincoLn works if
there's a redirect at Abraham lincoln).

It's difficult to look at the provided screenshot and not think that the
software has failed our readers. Unless you think these should be MediaWiki
page redirects (#REDIRECT)? I think we should do better normalization for
search inputs.

Any rough idea how big of a project this would be to implement?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2014-01-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

MZMcBride b...@mzmcbride.com changed:

   What|Removed |Added

   See Also||https://bugzilla.wikimedia.
   ||org/show_bug.cgi?id=36313

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2014-01-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

--- Comment #10 from MZMcBride b...@mzmcbride.com ---
(In reply to comment #9)
 We already do some input normalization at some level of the stack (for
 example, multiple underscores get squashed and input such as AbrAhAm LincoLn
 works if there's a redirect at Abraham lincoln).

To be more explicit on these points:

https://en.wikipedia.org/w/index.php?title=Special%3ASearchsearch=AbrAhAm+LincoLn

https://en.wikipedia.org/w/index.php?title=Special%3ASearchsearch=_AbrAhAm_LincoLn_

We may be able to implement apostrophe normalization at the same level.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2013-12-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

Nik Everett neverett+bugzi...@wikimedia.org changed:

   What|Removed |Added

   See Also||https://bugzilla.wikimedia.
   ||org/show_bug.cgi?id=57242

--- Comment #8 from Nik Everett neverett+bugzi...@wikimedia.org ---
Added see also bug.  I think we should do this when we pull the unicode plugin
in to Elasticsearch.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2013-11-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

--- Comment #6 from Nik Everett neverett+bugzi...@wikimedia.org ---
Chad,

Were you thinking this should be done in Cirrus for all languages by pushing
analysis configuration to Elasticsearch?  Something along those lines would be
pretty flexible, allowing, for example, us to boost perfect matches of the
typed unicode characters above the squashed ones.  I'm not saying that is a
good idea, just something that is possible.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2013-11-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

--- Comment #7 from Chad H. innocentkil...@gmail.com ---
(In reply to comment #6)
 Chad,
 
 Were you thinking this should be done in Cirrus for all languages by pushing
 analysis configuration to Elasticsearch?  Something along those lines would
 be
 pretty flexible, allowing, for example, us to boost perfect matches of the
 typed unicode characters above the squashed ones.

Yeah that was pretty much my thinking.

 I'm not saying that is a
 good idea, just something that is possible.

I think it's a good idea, eventually. I set priority so low on purpose :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 39501] Merging Unicode similar-looking characters in internal search (apostrophes, x and ×, etc)

2013-10-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=39501

Chad H. innocentkil...@gmail.com changed:

   What|Removed |Added

   Priority|Normal  |Lowest
 CC||innocentkil...@gmail.com,
   ||neverett+bugzilla@wikimedia
   ||.org
  Component|lucene-search-2 |CirrusSearch
Product|Wikimedia   |MediaWiki extensions
   Target Milestone|--- |Future release
Summary|Merging Unicode |Merging Unicode
   |apostrophe-like characters  |similar-looking characters
   |in internal search  |in internal search
   ||(apostrophes, x and ×,
   ||etc)
   Severity|normal  |enhancement

--- Comment #5 from Chad H. innocentkil...@gmail.com ---
Widening scope a tiny bit. If we're going to do this it should be done all at
once.

AntiSpoof's sort of the idea I'm thinking here.

Repurposing into a Cirrus bug as lsearchd has been end-of-lifed and won't be
fixed further.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l