[Bug 60484] CirrusSearch: Don't index text in visually hidden elements

bugzilla-daemon Wed, 19 Feb 2014 07:15:06 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=60484


--- Comment #7 from Nik Everett <[email protected]> ---

The biggest case I can think of for excluding text from search is the license
information on commons.  Please take that as an example.  Maybe it is the only
example I think it is pretty important.
1.  The license information doesn't add a whole lot to the result.  Try
searching commons with Cirrus for "distribute", "transmit", or "following" and
you'll very quickly start to see the text of the CC license.  And the searches
find 14 million results.  Heaven forbid you want to find "distributed
transmits" or something.  You'll almost exclusively get the license highlighted
and you'll still find 14 million results.  This isn't _horrible_ because the
top results all have "distribute" or "transmit" in the title but it isn't
great.
2.  Knock on effect from #2: because relevance is calculated based on the
inverse of the number of documents that contain the word the then every term in
the CC license is worth less then words not in the license.  I can't point to
any example of why that is bad but I feel it in my bones.  Feel free to ignore
this.  I'm probably paranoid.
3.  Entirely self serving: given #1, the contents of the license take up an
awful lot of space for very little benefit.  If I had more space I could make
Cirrus a beta on more wikis.  It is kind of a lame reason and I'm attacking the
space issue from other angles so maybe it'll be moot long before we get this
deployed and convince the community that it is worth doing.
4.  Really really self serving:  if .nosearch is the right solution and is
useful then it is super duper easy to implement.  Like one line of code, a few
tests, and bam.  Its already done, just waiting to be rebased and merged.  It
was so easy it would have taken longer to estimate the effort then to propose
an implementation.
I really wouldn't be surprised if someone couldn't come up with great reason
why #1 is silly and we just shouldn't do it.

The big problem with the nosearch class implementation is that it'd be pretty
simple to abuse and hard to catch the abuse because the text is still on the
page.  One of the nice things about the solution is you could use a web
browser's debugger to highlight all the text excluded from search by writing a
simple CSS class.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 60484] CirrusSearch: Don't index text in visually hidden elements

Reply via email to