Re: [Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

2016-08-04 Thread James Heilman
I wondering if Google would be willing to share that sort of data with us?
That would be useful for certain languages definitely.

J

On Thu, Aug 4, 2016 at 12:48 PM, WereSpielChequers <
werespielchequ...@gmail.com> wrote:

> At Wikimania in Gdansk someone from Google gave an interesting if somewhat
> controversial presentation on search improvements this way.
>
> From my memory of the presentation - it was a few years ago; For several
> languages including some Indic ones, I think Bangla and Telegu, Google had
> listed the 500 most common search terms that didn't have a Wikipedia
> article.
>
> They had then paid some translators to translate articles from English into
> those languages.
>
> This had become controversial because it resulted in a number of articles
> on Hollywood film stars, and at least one of the editors in those wikis
> didn't think that people who spoke his language were interested in
> Hollywood filmstars. Also the people writing those articles didn't behave
> as if cooperating with the community was part of their remit. One language,
> it may have been Bangla, actually blocked the translators.
>
> But logically the less complete a Wikipedia the more likely it is to have
> search terms that we could create articles for.
>
> I could even buy the idea that few of the unsuccessful searches on English
> have an obvious article.
>
> But for smaller Wikipedias this would be a useful tool to promote growth
> and to be more reader focussed.
>
> If the list was only made available as a deleted list so only admins could
> read it then that should resolve the issues of some searches being terms we
> wouldn't want to publicly list.
>
> WSC
> ___
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 




-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

2016-08-04 Thread WereSpielChequers
At Wikimania in Gdansk someone from Google gave an interesting if somewhat
controversial presentation on search improvements this way.

From my memory of the presentation - it was a few years ago; For several
languages including some Indic ones, I think Bangla and Telegu, Google had
listed the 500 most common search terms that didn't have a Wikipedia
article.

They had then paid some translators to translate articles from English into
those languages.

This had become controversial because it resulted in a number of articles
on Hollywood film stars, and at least one of the editors in those wikis
didn't think that people who spoke his language were interested in
Hollywood filmstars. Also the people writing those articles didn't behave
as if cooperating with the community was part of their remit. One language,
it may have been Bangla, actually blocked the translators.

But logically the less complete a Wikipedia the more likely it is to have
search terms that we could create articles for.

I could even buy the idea that few of the unsuccessful searches on English
have an obvious article.

But for smaller Wikipedias this would be a useful tool to promote growth
and to be more reader focussed.

If the list was only made available as a deleted list so only admins could
read it then that should resolve the issues of some searches being terms we
wouldn't want to publicly list.

WSC
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

2016-08-04 Thread Gerard Meijssen
Hoi,
English Wikipedia is to be honest the least of my concerns.It is no
different from any of the others. Like all other Wikipedias there are
significantly more items with a label in the language of the project in
Wikidata than the project has articles. So much so that the number of
deleted articles are corner cases, the number of items with BLP issues are
corner cases. I seem to remember that Wikidata has over 20% more items with
an English label (it could be 40% but I am not sure).

When you then analyse those items, you find professors of Harvard or any
other Ivy League university that have an article in another language, you
find people who should be in a category (based on WD info) but are not. You
will find errors in a project because that project has it wrong.You will
find many items with no article that are clearly notable.

When a BLP issue exists, it is typically in the text of the article not in
the data at Wikidata. So it typically does not have much bearing on
Wikidata anyway. When it does, it means that the WMF has to consider BLP
issues in an all projects point of view. So far for good reasons that has
not happened as far as I know.

When you consider search and improving search, one of the KPI's is labels
in a language. A Wikidata item already has some notability and when we can
grow the number of labels, we can improve search, coverage and likelihood
of new articles in those languages. An example is that all current
parliamentarians of India have an item but they do not have labels in the
languages of India. When we actively encourage more labels eg by exposing
them in search as we do on the Tamil Wikipedia [1] we serve the data we
hold in addition to what a local search provides.

It is particularly important for the small Wikipedias to have search
functionality that provides a rich result. I am afraid that the current
en.wp centric and problem averse attitude handicaps what we could achieve
in the small Wikipedias.
Thanks,
  GerardM


[1]
https://ta.wikipedia.org/w/index.php?search=Valerie+Sutton=%E0%AE%9A%E0%AE%BF%E0%AE%B1%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AF%81:Search=Go=8g40yt1e9f4k6j09iatmkfg64

On 3 August 2016 at 21:12, Deborah Tankersley 
wrote:

> Hi Gerard,
>
> I chatted with Trey (who did the analysis) for his opinion on your
> concerns. Here is his response:
>
> Hi Gerard,
>
> I wasn't trying to pass judgement on notability when the search referred to
> > a particular person, place, or thing, but I did take it as a sign of
> > non-notability when a page had been created and then deleted for a
> > particular person or website. Those items could become notable in the
> > future, and any of them might be notable enough for Wikidata—but the
> > original discussion seemed to be mainly about queries to English
> Wikipedia.
> > My conclusion, for English Wikipedia, is that there is not some gold mine
> > of super high-frequency typos or new topics that we are missing out on.
> > More importantly, there are real privacy concerns, and simple fixes—like
> > requiring some number of unique IP addresses to have searched fro
> > something—are not enough.
> > I have looked at thousands of queries from about a dozen other language
> > Wikipedias—some in more depth than others, and admittedly not usually
> > sorted by frequency—but my intuition is the same as it was for English
> > Wikipedia: not enough of value there to override privacy concerns.
> > Automation is out for privacy reasons and manual review is not worth it,
> > so this isn't a priority for Discovery right now.
>
>
> I hope that helps to further explain what we found and why we're not acting
> further on this issue at this time.
>
> Cheers,
>
> Deb
>
> --
> Deb Tankersley
> Product Manager, Discovery
> IRC: debt
> Wikimedia Foundation
>
> On Sat, Jul 30, 2016 at 1:30 AM, Gerard Meijssen <
> gerard.meijs...@gmail.com>
> wrote:
>
> > Hoi,
> > So what do we have? It is what the most missed searches are for the
> English
> > Wikipedia. Arguably the searches include content that is "iffie". But
> when
> > many people seek info on a porn site, on what basis is it not notable?
> This
> > is only for en.wp and the results for other languages can be quite
> > different.The problem with dismissing the need for this data in this way
> is
> > that it supports the status quo for all Wikipedias. It does not suggest
> > what we can do with a porn site. We could for instance have a Wikidata
> item
> > stating that it is a porn site and leave it at that.
> >
> > When you compare Wikidata with Wikipedia, Wikidata has significantlyu
> more
> > data about whatever than Wikipedia does. All subjects that are notable by
> > Wikidata standards and many are notable by English Wikipedia standards.
> > Knowing what subjects are missed in Wikipedia and what people are looking
> > for is important because they are the people Wikipedia misses.
> >
> > NB thanks for the data, the project.
> > Thanks,
> >   GerardM
> >
> > On 

Re: [Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

2016-08-03 Thread Deborah Tankersley
Hi Gerard,

I chatted with Trey (who did the analysis) for his opinion on your
concerns. Here is his response:

Hi Gerard,

I wasn't trying to pass judgement on notability when the search referred to
> a particular person, place, or thing, but I did take it as a sign of
> non-notability when a page had been created and then deleted for a
> particular person or website. Those items could become notable in the
> future, and any of them might be notable enough for Wikidata—but the
> original discussion seemed to be mainly about queries to English Wikipedia.
> My conclusion, for English Wikipedia, is that there is not some gold mine
> of super high-frequency typos or new topics that we are missing out on.
> More importantly, there are real privacy concerns, and simple fixes—like
> requiring some number of unique IP addresses to have searched fro
> something—are not enough.
> I have looked at thousands of queries from about a dozen other language
> Wikipedias—some in more depth than others, and admittedly not usually
> sorted by frequency—but my intuition is the same as it was for English
> Wikipedia: not enough of value there to override privacy concerns.
> Automation is out for privacy reasons and manual review is not worth it,
> so this isn't a priority for Discovery right now.


I hope that helps to further explain what we found and why we're not acting
further on this issue at this time.

Cheers,

Deb

--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation

On Sat, Jul 30, 2016 at 1:30 AM, Gerard Meijssen 
wrote:

> Hoi,
> So what do we have? It is what the most missed searches are for the English
> Wikipedia. Arguably the searches include content that is "iffie". But when
> many people seek info on a porn site, on what basis is it not notable? This
> is only for en.wp and the results for other languages can be quite
> different.The problem with dismissing the need for this data in this way is
> that it supports the status quo for all Wikipedias. It does not suggest
> what we can do with a porn site. We could for instance have a Wikidata item
> stating that it is a porn site and leave it at that.
>
> When you compare Wikidata with Wikipedia, Wikidata has significantlyu more
> data about whatever than Wikipedia does. All subjects that are notable by
> Wikidata standards and many are notable by English Wikipedia standards.
> Knowing what subjects are missed in Wikipedia and what people are looking
> for is important because they are the people Wikipedia misses.
>
> NB thanks for the data, the project.
> Thanks,
>   GerardM
>
> On 29 July 2016 at 23:48, Deborah Tankersley 
> wrote:
>
> > Forwarding to the Wikimedia mailing list, I'm sorry for the lateness!
> >
> >
> > --
> > Deb Tankersley
> > Product Manager, Discovery
> > IRC: debt
> > Wikimedia Foundation
> >
> > -- Forwarded message --
> > From: Trey Jones 
> > Date: Mon, Jul 25, 2016 at 11:58 AM
> > Subject: Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of)
> > To: A public mailing list about Wikimedia Search and Discovery projects <
> > discov...@lists.wikimedia.org>
> > Cc: James Heilman 
> >
> >
> > I decided to look into this as my 10% project last week. It ended up
> being
> > a 15% project, but I wanted to finish it up.
> >
> > I carefully reviewed and categorized the top 100 "unsuccessful" (i.e.,
> > zero-results) queries from May 2016, and skimmed the top 1,000 from May,
> > and skimmed and compared the top 100 / 1,000 for June.
> >
> > The top result (with several variants in the top 100) is a porn site that
> > has had a wiki page created and deleted several times. Various websites
> > round out the top 10. Internet personalities and websites dominate the
> top
> > 100 and several have had pages created and deleted over the years.
> There's
> > strong evidence of links being used for some queries—though I didn't try
> to
> > track them down. There's plenty of personally identifiable information in
> > the top 1000 most frequent queries. More than 10% of the queries (by
> > volume) get good results from the completion suggester or "did you mean"
> > spelling suggestions, and more than 10% have some results approximately
> two
> > months later (i.e., late last week).
> >
> > Obvious refinements to the search strategy would eliminate so many
> > high-frequency queries that any useful mining would be down to slogging
> > through the low-impact long tail.
> >
> > I don’t think there’s a lot here worth extracting, though others may
> > disagree. The privacy concerns expressed earlier are genuine, and simple
> > attempts to filter PII (using patterns, minimum IP counts, etc) are not
> > guaranteed to be effective.
> >
> > For lots more details (but no actual queries), see here:
> >
> >
> >
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Search_Queries
> >
> > —Trey
> >
> > Trey 

Re: [Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

2016-07-30 Thread Gerard Meijssen
Hoi,
So what do we have? It is what the most missed searches are for the English
Wikipedia. Arguably the searches include content that is "iffie". But when
many people seek info on a porn site, on what basis is it not notable? This
is only for en.wp and the results for other languages can be quite
different.The problem with dismissing the need for this data in this way is
that it supports the status quo for all Wikipedias. It does not suggest
what we can do with a porn site. We could for instance have a Wikidata item
stating that it is a porn site and leave it at that.

When you compare Wikidata with Wikipedia, Wikidata has significantlyu more
data about whatever than Wikipedia does. All subjects that are notable by
Wikidata standards and many are notable by English Wikipedia standards.
Knowing what subjects are missed in Wikipedia and what people are looking
for is important because they are the people Wikipedia misses.

NB thanks for the data, the project.
Thanks,
  GerardM

On 29 July 2016 at 23:48, Deborah Tankersley 
wrote:

> Forwarding to the Wikimedia mailing list, I'm sorry for the lateness!
>
>
> --
> Deb Tankersley
> Product Manager, Discovery
> IRC: debt
> Wikimedia Foundation
>
> -- Forwarded message --
> From: Trey Jones 
> Date: Mon, Jul 25, 2016 at 11:58 AM
> Subject: Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of)
> To: A public mailing list about Wikimedia Search and Discovery projects <
> discov...@lists.wikimedia.org>
> Cc: James Heilman 
>
>
> I decided to look into this as my 10% project last week. It ended up being
> a 15% project, but I wanted to finish it up.
>
> I carefully reviewed and categorized the top 100 "unsuccessful" (i.e.,
> zero-results) queries from May 2016, and skimmed the top 1,000 from May,
> and skimmed and compared the top 100 / 1,000 for June.
>
> The top result (with several variants in the top 100) is a porn site that
> has had a wiki page created and deleted several times. Various websites
> round out the top 10. Internet personalities and websites dominate the top
> 100 and several have had pages created and deleted over the years. There's
> strong evidence of links being used for some queries—though I didn't try to
> track them down. There's plenty of personally identifiable information in
> the top 1000 most frequent queries. More than 10% of the queries (by
> volume) get good results from the completion suggester or "did you mean"
> spelling suggestions, and more than 10% have some results approximately two
> months later (i.e., late last week).
>
> Obvious refinements to the search strategy would eliminate so many
> high-frequency queries that any useful mining would be down to slogging
> through the low-impact long tail.
>
> I don’t think there’s a lot here worth extracting, though others may
> disagree. The privacy concerns expressed earlier are genuine, and simple
> attempts to filter PII (using patterns, minimum IP counts, etc) are not
> guaranteed to be effective.
>
> For lots more details (but no actual queries), see here:
>
>
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Search_Queries
>
> —Trey
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Fri, Jul 15, 2016 at 11:31 AM, Trey Jones  wrote:
>
> > Finally, if this is important enough and the task gets prioritized, I'd
> be
> > willing to dive back in and go through the process once and pull out the
> > top zero-results queries, this time with basic bot exclusion and IP
> > deduplication—which we didn't do early on because we didn't realize what
> a
> > mess the data was. We could process a week or a month of data and
> > categorize the top 100 to 500 results in terms of personal info, junk,
> > porn, and whatever other categories we want or that bubble up from the
> > data, and perhaps publish the non-personal-info part of the list as an
> > example, either to persuade ourselves that this is worth pursuing, or as
> a
> > clearer counter to future calls to do so.
> > —Trey
> >
> >>
>
> > -- Forwarded message --
> >> From: "James Heilman" 
> >> Date: Jul 15, 2016 06:33
> >> Subject: [Wikimedia-l] Improving search (sort of)
> >> To: "Wikimedia Mailing List" 
> >> Cc:
> >>
> >> A while ago I requested a list of the "most frequently searched for
> terms
> >> for which no Wikipedia articles are returned". This would allow the
> >> community to than create redirect or new pages as appropriate and help
> >> address the "zero results rate" of about 30%.
> >>
> >> While we are still waiting for this data I have recently come across a
> >> list
> >> of the most frequently clicked on redlinks on En WP produced by Andrew
> >> West
> >> https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many
> of
> >> these can be reasonably addressed with a redirect as 

[Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

2016-07-29 Thread Deborah Tankersley
Forwarding to the Wikimedia mailing list, I'm sorry for the lateness!


--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation

-- Forwarded message --
From: Trey Jones 
Date: Mon, Jul 25, 2016 at 11:58 AM
Subject: Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of)
To: A public mailing list about Wikimedia Search and Discovery projects <
discov...@lists.wikimedia.org>
Cc: James Heilman 


I decided to look into this as my 10% project last week. It ended up being
a 15% project, but I wanted to finish it up.

I carefully reviewed and categorized the top 100 "unsuccessful" (i.e.,
zero-results) queries from May 2016, and skimmed the top 1,000 from May,
and skimmed and compared the top 100 / 1,000 for June.

The top result (with several variants in the top 100) is a porn site that
has had a wiki page created and deleted several times. Various websites
round out the top 10. Internet personalities and websites dominate the top
100 and several have had pages created and deleted over the years. There's
strong evidence of links being used for some queries—though I didn't try to
track them down. There's plenty of personally identifiable information in
the top 1000 most frequent queries. More than 10% of the queries (by
volume) get good results from the completion suggester or "did you mean"
spelling suggestions, and more than 10% have some results approximately two
months later (i.e., late last week).

Obvious refinements to the search strategy would eliminate so many
high-frequency queries that any useful mining would be down to slogging
through the low-impact long tail.

I don’t think there’s a lot here worth extracting, though others may
disagree. The privacy concerns expressed earlier are genuine, and simple
attempts to filter PII (using patterns, minimum IP counts, etc) are not
guaranteed to be effective.

For lots more details (but no actual queries), see here:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Search_Queries

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Fri, Jul 15, 2016 at 11:31 AM, Trey Jones  wrote:

> Finally, if this is important enough and the task gets prioritized, I'd be
> willing to dive back in and go through the process once and pull out the
> top zero-results queries, this time with basic bot exclusion and IP
> deduplication—which we didn't do early on because we didn't realize what a
> mess the data was. We could process a week or a month of data and
> categorize the top 100 to 500 results in terms of personal info, junk,
> porn, and whatever other categories we want or that bubble up from the
> data, and perhaps publish the non-personal-info part of the list as an
> example, either to persuade ourselves that this is worth pursuing, or as a
> clearer counter to future calls to do so.
> —Trey
>
>>

> -- Forwarded message --
>> From: "James Heilman" 
>> Date: Jul 15, 2016 06:33
>> Subject: [Wikimedia-l] Improving search (sort of)
>> To: "Wikimedia Mailing List" 
>> Cc:
>>
>> A while ago I requested a list of the "most frequently searched for terms
>> for which no Wikipedia articles are returned". This would allow the
>> community to than create redirect or new pages as appropriate and help
>> address the "zero results rate" of about 30%.
>>
>> While we are still waiting for this data I have recently come across a
>> list
>> of the most frequently clicked on redlinks on En WP produced by Andrew
>> West
>> https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of
>> these can be reasonably addressed with a redirect as the issue is often
>> capitals.
>>
>> Do anyone know where things are at with respect to producing the list of
>> most search for terms that return nothing?
>>
>> --
>> James Heilman
>> MD, CCFP-EM, Wikipedian
>>
>
___
discovery mailing list
discov...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,