Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)

James Heilman Fri, 15 Jul 2016 09:20:18 -0700

The "jurrasic world" example is a good one as it was "fixed" by User:Foxj
adding a redirect
https://en.wikipedia.org/w/index.php?title=Jurrasic_world&action=history


Agree we would need to be careful. The chance of many different IPs all
searching for "DF198671E" is low but I agree not zero and we would need to
have people run the results before they are displayed.

I guess the question is how much work would it take to look at this sort of
data for more examples like "jurrasic world"?

James

On Fri, Jul 15, 2016 at 10:05 AM, Dan Garry <dga...@wikimedia.org> wrote:

> On 15 July 2016 at 08:44, James Heilman <jmh...@gmail.com> wrote:
> >
> > Thanks for the in depth discussion. So if the terms people are using that
> > result in "zero search results" are typically gibberish why do we care if
> > 30% of our searches result in "zero search results"? A big deal was made
> > about this a while ago.
> >
>
> Good question! I originally used to say that it was my aspiration that
> users should never get zero results when searching Wikipedia. As a result
> of Trey's analysis, I don't say that any more. ;-) There are many
> legitimate cases where users should get zero results. However, there are
> still tons of examples of where giving users zero results is incorrect;
> "jurrasic world" was a prominent example of that.
>
> It's still not quite right to say that *all* the terms that people use to
> get zero results are gibberish. There is an extremely long tail
> <https://en.wikipedia.org/wiki/Long_tail> of zero results queries that
> aren't gibberish, it's just that the top 100 are dominated by gibberish.
> This would mean we'd have to release many, many more than the top 100,
> which significantly increases the risk of releasing personal information.
>
>
> > If one was just to look at those search terms that more than 100 IPs
> > searched for would that not remove the concerns about anonymity? One
> could
> > also limit the length of the searches displaced to 50 characters. And
> just
> > provide the first 100 with an initial human review to make sure we are
> not
> > miss anything.
> >
>
> The problem with this is that there are still no guarantees. What if you
> saw the search query "DF198671E"? You might not think anything of it, but I
> would recognise it as an example of a national insurance number
> <https://en.wikipedia.org/wiki/National_Insurance_number>, the British
> equivalent of a social security number [1]. There's always going to be the
> potential that we accidentally release something sensitive when we release
> arbitrary user input, even if it's manually examined by humans.
>
> So, in summary:
>
>    - The top 100 zero results queries are dominated by gibberish.
>    - There's a long tail of zero results queries, meaning we'd have to
>    reduce many more than the top 100.
>    - Manually examining the top zero results queries is not a foolproof way
>    of eliminating personal data since it's arbitrary user input.
>
> I'm happy to answer any questions. :-)
>
> Thanks,
> Dan
>
> [1]: Don't panic, this example national insurance number is actually
> invalid. ;-)
>
> --
> Dan Garry
> Lead Product Manager, Discovery
> Wikimedia Foundation
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>




-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)

Reply via email to