[Wikitech-l] [BREAKING CHANGE] New optional optional RevisionRecord param added to ContentHandler::getDataForSearchIndex()

2022-10-31 Thread David Causse
Hi,

The method ContentHandler::getDataForSearchIndex() is used by (non-core) MW
SearchEngine implementations and was added in MW 1.28 to gather the data to
be indexed.
In 1.40 this signature will change to let SearchEngine implementations pass
an optional RevisionRecord param
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/832957/10/includes/content/ContentHandler.php#1382>
to clarify what revision is being indexed (instead of assuming that the
"latest" was requested).
This is a breaking change if you maintain a ContentHandler subclass and
have overridden the getDataForSearchIndex() method, you will have to change
your method signature from:

public function getDataForSearchIndex( WikiPage $page, ParserOutput
$output, SearchEngine $engine )

to

public function getDataForSearchIndex( WikiPage $page, ParserOutput
$output, SearchEngine $engine, RevisionRecord $revision = null )

If you own a SearchEngine implementation and rely on
ContentHandler::getDataForSearchIndex() this is not a breaking change but
it is advised to pass this new RevisionRecord parameter when calling it.

Relatedly the hook SearchDataForIndex
<https://www.mediawiki.org/wiki/Manual:Hooks/SearchDataForIndex> will be
deprecated (silently) and SearchDataForIndex2
<https://www.mediawiki.org/wiki/Manual:Hooks/SearchDataForIndex2> should be
used instead.

Please let me know if you have questions/strong objections to this approach.

--
David Causse
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: Getting CirrusSearch full-text search to incorporate a new field?

2021-10-25 Thread David Causse
Hi Matto,

please see my answers inline.

On Mon, Oct 25, 2021 at 6:37 AM Matto Marjanovic  wrote:

> [...]

It would be a-ok if the 'more_file_text' could just be treated as additional
> content for the 'file_text' field.  (However, simply populating the
> existing
> 'file_text' field via the SearchDataForIndexHook does not work, because the
> FileContentHandler::getDataForSearchIndex() method runs after the hook and
> always forcefully overwrites the 'file_text' field.)
>

This should be do-able by implementing the CirrusSearchBuildDocumentParse
hook which runs very late in the process (see cirrus doc under
docs/hooks.txt).
It could be only CirrusSearchBuildDocumentParse if you have the data at
hand when this hook runs or a combination of SearchDataForIndexHook to
populate a "more_file_text" field like you do +
CirrusSearchBuildDocumentParse to append this "more_file_text" to the
existing "file_text" and possibly empty the "more_file_text" field if you
no longer need it.

There are probably more ways to achieve what you want with greater control
of the ranking but this will probably be much more involved (i.e. writing
your own search query builder).

--
David Causse
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Re: [Wikitech-l] How are value node IRIs generated?

2020-12-10 Thread David Causse
On Thu, Dec 10, 2020 at 10:39 AM David Causse  wrote:

>
> About 35976d7cb070b06a2dec1482aaca2982df3fedd4 which I think you obtained
> from the wbgetclaims api[0]? I think this hash identifies the Snak while
> the one you see from the query service identifies the Value, the former
> will uniquely identifies the Snak so that for another entity using the same
> value[1] the Snak hash is different
> (8eb6208639efa82b5e7e4c709b7d18cbfca67411) but the value is identical
> (+2019-12-14T00:00:00Z).
>

Please read a249e4ebdd5022be6cc7fe25feb93e0503eac6ee instead of
8eb6208639efa82b5e7e4c709b7d18cbfca67411 (wrong copy/paste).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How are value node IRIs generated?

2020-12-10 Thread David Causse
Hi,

On Wed, Dec 9, 2020 at 6:22 PM  wrote:

> On Wed, Dec 9, 2020 at 6:17 PM Baskauf, Steven James <
> steve.bask...@vanderbilt.edu> wrote:
>
>> When performing a SPARQL query at the WD Query Service (example:
>> https://w.wiki/ptp), these value nodes are identified by an IRI such as
>> wdv: 742521f02b14bf1a6cbf7d4bc599eb77 (
>> http://www.wikidata.org/value/742521f02b14bf1a6cbf7d4bc599eb77). The
>> local name part of this IRI seems to be a hash of something. However, when
>> I compare the hash values from the snak JSON returned from the API for the
>> same value node (see
>> https://gist.github.com/baskaufs/8c86bc5ceaae19e31fde88a2880cf0e9 for
>> the example), the hash associated with the value node
>> (35976d7cb070b06a2dec1482aaca2982df3fedd4 in this case) does not have any
>> relationship to the local name part if the IRI for that value node.
>>
>
>
> https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Value_representation
> states:
> > Full values are represented as nodes having prefix wdv: and the local
> name being the hash of the value contents (e.g.
> wdv:382603eaa501e15688076291fc47ae54). There is no guarantee of the value
> of the hash except for the fact that different values will be represented
> by different hashes, and same value mentioned in different places will have
> the same hash.
>
>
indeed no assumptions should be made on this hash value, the initial goal
was (I think) for two unrelated claims that have the same complex value
elements to share it instead of reifying one for each claim/reference. I
would strongly advise against storing this hash for later use.

About 35976d7cb070b06a2dec1482aaca2982df3fedd4 which I think you obtained
from the wbgetclaims api[0]? I think this hash identifies the Snak while
the one you see from the query service identifies the Value, the former
will uniquely identifies the Snak so that for another entity using the same
value[1] the Snak hash is different
(8eb6208639efa82b5e7e4c709b7d18cbfca67411) but the value is identical
(+2019-12-14T00:00:00Z).

I don't think you can extract the hash of the value using wbgetclaims but
it is visible using the RDF output[2].

0:
https://www.wikidata.org/w/api.php?action=wbgetclaims=Q42352198=P496=2
1:
https://www.wikidata.org/w/api.php?action=wbgetclaims=Q232113=P570=2
2:
https://www.wikidata.org/wiki/Special:EntityData/Q42352198.ttl?flavor=dump
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Major trouble with Cirrus forceSearchIndex.php script

2019-11-28 Thread David Causse
On Wed, Nov 27, 2019 at 7:05 PM David Causse  wrote:

> On Sat, Nov 16, 2019 at 6:58 PM Aran via Wikitech-l <
> wikitech-l@lists.wikimedia.org> wrote:
>
>> Hi all,
>> [...]
>> Please if anyone has heard of this kind of things and could point us in
>> the right direction here that would be awesome!
>>
>>
> Hi,
> no, I've never encountered such random scenario. If inspecting the various
> logs (mediawiki and elasticsearch) did not provide any clues I would
> suggest adding debug log messages to the DataSender::sendData method
> (includes/DataSender.php). This is the last method called from mediawiki
> before reaching elasticsearch.
> If you find something interesting or something you think is broken please
> file a task to http://phabricator.wikimedia.org/ under the tag
> CirrusSearch.
>


I forgot to mention that we host office hours every first Wednesday of the
month, this might be a good opportunity to discuss this :
Details for our next meeting:

Date: Wednesday, Dec 6th, 2019

Time: 16:00-17:00 GMT / 08:00-09:00 PST / 11:00-12:00 EST / 17:00-18:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vyc-jvgq-dww
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Major trouble with Cirrus forceSearchIndex.php script

2019-11-27 Thread David Causse
On Sat, Nov 16, 2019 at 6:58 PM Aran via Wikitech-l <
wikitech-l@lists.wikimedia.org> wrote:

> Hi all,
> [...]
> Please if anyone has heard of this kind of things and could point us in
> the right direction here that would be awesome!
>
>
Hi,
no, I've never encountered such random scenario. If inspecting the various
logs (mediawiki and elasticsearch) did not provide any clues I would
suggest adding debug log messages to the DataSender::sendData method
(includes/DataSender.php). This is the last method called from mediawiki
before reaching elasticsearch.
If you find something interesting or something you think is broken please
file a task to http://phabricator.wikimedia.org/ under the tag CirrusSearch.

David.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Declaring methods final in classes

2019-08-29 Thread David Causse
On Thu, Aug 29, 2019 at 9:36 AM Daniel Kinzler 
wrote:

>
> Narrow interfaces help with that. If we had for instance a cache interface
> that
> defined just the get() and set() methods, and that's all the code needs,
> then we
> can just provide a mock for that interface, and we wouldn't have to worry
> about
> WANObjectCache or its final methods at all.
>
>
I think this is the right solution, forbidding one feature of the language
(final) because of the current design of WANObjectCache seems going too far
in my opinion.

--
David Causse
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Upcoming Search Platform Office Hours—July 3rd

2019-07-02 Thread David Causse
The Search Platform Team
 usually holds
office hours the first Wednesday of each month. Come talk to us about
anything related to Wikimedia search!


Feel free to add your items to the Etherpad Agenda for the next meeting.


Details for our next meeting:

Date: Wednesday, June 3rd, 2019

Time: 15:00-16:00 GMT / 08:00-9:00 PDT / 11:00-12:00 EDT / 17:00-18:00 CEST

Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours

Google Meet link: https://meet.google.com/vyc-jvgq-dww


*N.B.:* Google Meet System Requirements



Hope to talk to you in a week!

—David (Trey is out this week and will be back shortly)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Advice on Tuning Search?

2018-11-02 Thread David Causse
On Fri, Nov 2, 2018 at 3:51 AM Hogan (US), Michael C <
michael.c.hog...@boeing.com> wrote:

> Can anyone point me to a starting point for learning about how to tune
> CirrusSearch (or examples)? I found the CirrusSearchScoreBuilder page [1],
> which implies it is possible to modify how search results are ranked. But,
> the documentation page hasn't been created yet. Thank you!
>

Hi,

there are many ways to tune the ranking of search results.
The hook you mention is designed to be used by extensions that want to tune
everything related to the search query itself. I strongly discourage to use
it, it is highly experimental and will be removed in the future.

To understand how cirrus scores docs I suggest to start with this
documentation [2].
You can then tune the retrieval query using profiles and the
wgCirrusSearchFullTextQueryBuilderProfiles config array:
E.g.
$wgCirrusSearchFullTextQueryBuilderProfiles => [
'my_custom_profile' => [
'builder_class' =>
\CirrusSearch\Query\FullTextSimpleMatchQueryBuilder::class,
'settings' => [
'default_min_should_match' => '1',
'default_query_type' => 'most_fields',
'default_stem_weight' => 3.0,
'fields' => [
'title' => 0.3,
'redirect.title' => [
'boost' => 0.27,
'in_dismax' =>
'redirects_or_shingles'
],
'suggest' => [
'is_plain' => true,
'boost' => 0.20,
'in_dismax' =>
'redirects_or_shingles',
],
'category' => 0.05,
'heading' => 0.05,
'text' => [
'boost' => 0.6,
'in_dismax' =>
'text_and_opening_text',
],
'opening_text' => [
'boost' => 0.5,
'in_dismax' =>
'text_and_opening_text',
],
'auxiliary_text' => 0.05,
'file_text' => 0.5,
],
'phrase_rescore_fields' => [
'all' => 0.06,
'all.plain' => 0.1,
],
],
],
];

And then activate it by default:
$wgCirrusSearchFullTextQueryBuilderProfile = "perfield_builder";

Please see [3] for more doc on the various settings.

To tune the query independent signals (the rescoring part in the doc), this
is similar as you declare a profile and activate it by default.
The config var to add a new profile is $wgCirrusSearchRescoreProfiles and
you can add more by following these examples [4].
The config var to change the default rescore profile is
$wgCirrusSearchRescoreProfile.
Rescore profiles internally use "rescore function chains" which can be
tuned as well using $wgCirrusSearchRescoreFunctionChains [5].

I'm sorry if this is bit dense and for the lack of comprehensive
documentation. I suggest having a look at the elasticsearch documentation
as well as many concepts here are related to elasticsearch features
(dismax, rescoring, function score, ...).
We have also some integration with the LTR plugin [6].

Please let me know if you have specific questions or specific problems I
could help going into a specific direction instead of digesting all of this.

Thank you.

[2] https://www.mediawiki.org/wiki/Extension:CirrusSearch/Scoring
[3]
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CirrusSearch/+/master/profiles/FullTextQueryBuilderProfiles.config.php#39
[4]
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CirrusSearch/+/master/profiles/RescoreProfiles.config.php
[5]
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CirrusSearch/+/master/profiles/RescoreFunctionChains.config.php
[6] https://github.com/o19s/elasticsearch-learning-to-rank
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] PHP7 expectations (zero-cost assertions)

2018-03-15 Thread David Causse
On Thu, Mar 15, 2018 at 3:57 PM, Brad Jorsch (Anomie)  wrote:
>
>
> PHP7's expectations seem like they started fixing those issues, although
> eval()-like use is still an option and exception-throwing seems to not be
> the default.
>

indeed I must admit that it's rather concerning that assert( 'code' ) will
try to evaluate the string arg
and I completely overlooked this. I was focused on the perf arguments.

But what I understood from discussion is that even if PHP resolves these
issues in the future
it's unlikely that we will disable assertions in production mode for WMF
appservers.
So assert() will never be an option for performance hotspots.

Thanks for your feedback!
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] PHP7 expectations (zero-cost assertions)

2018-03-15 Thread David Causse
The biggest take-away (for me) of the discussion is:
Pros:
- perf: zero-cost assertions
Cons:
- the benefits of zero-cost assertion is not worth the risk in a moving
code-base like MW.
The argument is that even in the case of assert being used properly (to
expose strong expectations that cannot be unmet). Problems are too
frequently found in production mode so benefiting from zero-cost assertions
would be a risk in the sense that we would not be able to detect these
errors.


On Thu, Mar 15, 2018 at 3:30 PM, Cormac Parle <cpa...@wikimedia.org> wrote:

> Was the conclusion “don’t use assert()”? It’s not really that clear to me
>
> (fwiw I've always felt a bit squiffy about assert()s in production code,
> because it’s easy to make a php config mistake and get errors happening all
> over the place)
>
> > On 15 Mar 2018, at 14:17, David Causse <dcau...@wikimedia.org> wrote:
> >
> > Replying to myself:
> > I just found some discussions here:
> > https://lists.gt.net/wiki/wikitech/378676
> > I bet that the new assert features in PHP7 don't change the conclusions
> > here, so please ignore my e-mail and sorry for the noise.
> >
> > On Thu, Mar 15, 2018 at 2:42 PM, David Causse <dcau...@wikimedia.org>
> wrote:
> >
> >> Hi,
> >>
> >> Sometimes I find adding assert() calls in my code very handy for various
> >> reasons:
> >> - failures in development mode on some complex code where exposing all
> the
> >> details to unit tests is sometimes hard and/or pointless
> >> - readability of the code
> >>
> >> But I worry about the perf implications of these lines of code. I don't
> >> want these assertions to be used to track errors in production mode.
> >>
> >> PHP7 introduced expectations which permit to have zero-cost assert() [1]
> >> Looking at the MW codebase we don't seem to use assert frequently (only
> 26
> >> files [2] ).
> >>
> >> Are there some discussions about this?
> >> Is assert() a good practice for the MW code base?
> >> If yes would it make sense to benefit from zero-cost assertions in WMF
> >> appservers?
> >>
> >> Thanks!
> >>
> >> [1] http://php.net/manual/en/function.assert.php#function.
> >> assert.expectations
> >> [2] https://codesearch.wmflabs.org/search/?q=%5Ctassert%5C(&
> i=nope=
> >> php%24=
> >>
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] PHP7 expectations (zero-cost assertions)

2018-03-15 Thread David Causse
Replying to myself:
I just found some discussions here:
https://lists.gt.net/wiki/wikitech/378676
I bet that the new assert features in PHP7 don't change the conclusions
here, so please ignore my e-mail and sorry for the noise.

On Thu, Mar 15, 2018 at 2:42 PM, David Causse <dcau...@wikimedia.org> wrote:

> Hi,
>
> Sometimes I find adding assert() calls in my code very handy for various
> reasons:
> - failures in development mode on some complex code where exposing all the
> details to unit tests is sometimes hard and/or pointless
> - readability of the code
>
> But I worry about the perf implications of these lines of code. I don't
> want these assertions to be used to track errors in production mode.
>
> PHP7 introduced expectations which permit to have zero-cost assert() [1]
> Looking at the MW codebase we don't seem to use assert frequently (only 26
> files [2] ).
>
> Are there some discussions about this?
> Is assert() a good practice for the MW code base?
> If yes would it make sense to benefit from zero-cost assertions in WMF
> appservers?
>
> Thanks!
>
> [1] http://php.net/manual/en/function.assert.php#function.
> assert.expectations
> [2] https://codesearch.wmflabs.org/search/?q=%5Ctassert%5C(=nope=
> php%24=
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] PHP7 expectations (zero-cost assertions)

2018-03-15 Thread David Causse
Hi,

Sometimes I find adding assert() calls in my code very handy for various
reasons:
- failures in development mode on some complex code where exposing all the
details to unit tests is sometimes hard and/or pointless
- readability of the code

But I worry about the perf implications of these lines of code. I don't
want these assertions to be used to track errors in production mode.

PHP7 introduced expectations which permit to have zero-cost assert() [1]
Looking at the MW codebase we don't seem to use assert frequently (only 26
files [2] ).

Are there some discussions about this?
Is assert() a good practice for the MW code base?
If yes would it make sense to benefit from zero-cost assertions in WMF
appservers?

Thanks!

[1]
http://php.net/manual/en/function.assert.php#function.assert.expectations
[2]
https://codesearch.wmflabs.org/search/?q=%5Ctassert%5C(=nope=php%24=
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] AdvancedSearch beta feature now on Mediawiki

2017-11-27 Thread David Causse
Thanks for the hard work.
Very nice to see that advanced search can now be simple given how complex
(and sometimes messy) the search syntax can be.
Congrats to the team!

On Wed, Nov 22, 2017 at 10:50 PM, יגאל חיטרון 
wrote:

> Hi. It's me again. Now I can check this out, and it's very good, indeed.
> Igal
>
>
> 2017-11-22 21:37 GMT+02:00 Stas Malyshev :
>
> > Hi!
> >
> > > haha, awesome!
> > >
> > > thanks a lot :-)
> >
> > Confirming, looks great for me now. And congratulations to the team on
> > the release of this excellent feature!
> >
> > --
> > Stas Malyshev
> > smalys...@wikimedia.org
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] CirrusSearch forceSearchIndex.php issue

2017-04-03 Thread David Causse
Hi Aran,

without any error messages it's hard to guess what could be the cause of
this problem.
This script has been used to index million pages but there must be some
configuration that trigger this strange behavior you are experiencing.
It may require time to investigate so in order not to lose track of your
issue could you file an issue to https://phabricator.wikimedia.org/ and add
all the details you can extract from your installation.

Thanks!

On Sun, Apr 2, 2017 at 9:40 PM, Aran  wrote:

> Hello,
>
> I've found that running forceSearchIndex.php with --skipLinks
> --indexOnSkip options is only processing about 1000-1500 pages and then
> stopping. It's not due to error or anything because I can set --fromId
> and run it again and it does another batch. I've tried setting --limit
> and --batchSize and all sorts but nothing allows it to do more than this
> amount at a time anyone have any idea what might be going on here?
>
> (It's happening on both MW 1.27 with Elastic Search 1.75 and MW 1.28
> with elastic 2.4.4)
>
> Thanks,
> Aran
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] ORES article quality data as a database table

2016-09-22 Thread David Causse

Thanks Amir!

It's been a long since I wanted to include wp10 in our search indices to 
experiment with this data as a relevance signal.


This is now possible with your dataset and I've built a test index[1] 
which uses the following signals to rank results:


- incoming links

- weekly pageviews

- wp10

The weights for these signals have not been properly tuned yet but they 
can be adjusted at query time with uri query param:


- cirrusIncLinksW: weight for a value that ranges from 0 to 1

- cirrusPageViewsW: weight for a value that ranges from 0 to 1

- cirrusWP10W: weight for a value that ranges from 0 to 5

Examples:

- articles in category 'History_of_Essex' sorted by WP10 best first [1]

- articles in category 'History_of_Essex' sorted by WP10 worst first [2]

I'd love to make this data available in a more convenient way with query 
keywords like wp10:0 and then allow playing other signals like pageviews.


Concerning internal search ranking we will soon evaluate how wp10 
compares with existing signals (inclinks/pageviews) and I'd like to use 
it as a replacement for the naive scoring method we use for autocomplete 
searches.


Well... everything is at an early stage but I believe we can do very 
interesting things with wp10 and search, I still don't know exactly 
what, nor how :)


Thanks!


[1] http://en-wp-bm25-wp10-relforge.wmflabs.org/wiki/Special:Search

[2] 
http://en-wp-bm25-wp10-relforge.wmflabs.org/w/index.php?search=incategory%3A%22History_of_Essex%22=Special:Search=Go=10=0=0


[3] 
http://en-wp-bm25-wp10-relforge.wmflabs.org/w/index.php?search=incategory%3A%22History_of_Essex%22=Special:Search=Go=-10=0=0


Le 21/09/2016 à 11:11, Amir Ladsgroup a écrit :

One of ORES [1] applications is determining article quality. For example,
What would be the best assessment of an article in the given revision.
Users in wikiprojects use ORES data to check if articles need
re-assessment. e.g. if an article is in "Start" level and now good it's
enough to be a "B" article.

As part of Q4 goals, we made a dataset of article quality scores of all
articles in English Wikipedia [2] (Here's the link to download the dataset
[3]) and we are publishing it in figshare as something you can cite [4]
also we are working on publishing monthly data for researchers to track
article quality data change over time. [5]

As a pet project of mine, I always wanted to put these data in a database.
So we can query the database and get much more useful data. For example
quality of articles in category 'History_of_Essex' [6] [7]. The weighed sum
is a measure of quality which is a decimal number between 0 (really stub)
to 5 (a definitely featured article). We have also prediction column which
is a number in this map [8] for example if prediction is 5, it means ORES
thinks it should be a featured article.

I leave more use cases to your imagination :)

I'm looking for a more permanent place to put these data, please tell me if
it's useful for you.
[1] ORES is not a anti-vandalism tool, it's an infrastructure to use AI in
Wikipedia.
[2] https://phabricator.wikimedia.org/T135684
[3] (117 MBs)
https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/wp10-scores-enwiki-20160820.tsv.bz2
[4] https://phabricator.wikimedia.org/T145332
[5] https://phabricator.wikimedia.org/T145655
[6] https://quarry.wmflabs.org/query/12647
[7] https://quarry.wmflabs.org/query/12662
[8]
https://github.com/wiki-ai/wikiclass/blob/3ff2f6c44c52905c7202515c5c8b525fb1ceb291/wikiclass/utilities/extract_scores.py#L37

Have fun!
Amir
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [discovery] Weekly update for the week starting 2016-09-05

2016-09-12 Thread David Causse

Thanks Eran,

I completely forgot to mention this feature in the status update.

I hope to be able to build a test index very soon so we can start 
evaluating this new feature.


David.

Le 10/09/2016 à 11:23, Eran Rosenthal a écrit :

Thanks Deborah for the update.

Just to mention another interesting feature (implemented but yet 
evaluated/activated) - dcausse have implemented ability to show search 
results also based on the DEFAULTSORT. (T134978 
)
E.g when you search for Putin (and not Vladimir Putin) you will get 
suggestion for Vladimir Putin (even if there is no redirect from 
Putin), as its defaultsort is Putin, Vladimir.

This feature may have high impact once (and if) it is activated.


On Sat, Sep 10, 2016 at 1:44 AM, Deborah Tankersley 
> wrote:


Hello,

Here is the week's update from the Discovery department - enjoy
the read and your weekend!

== Discussions ==
* Trey completed the analysis for optimizing language
identification for the Dutch Wikipedia (nlwiki). The results were
good (F0.5 = 82.3%) but not great. The small proportions of
queries in the Romance languages and in German led to many more
false positives than true positives and so they had to be
excluded. Future work on improving confidence may help. [1]
* We could use help translating (via translatewiki) the relevant
"showing results from" messages into Dutch. We'll need English,
Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian
translations. [2]
* The Analysis team had a discussion on how to use better wording
for phrases like "users were 1.07 times more likely to do X" and
decided on using phrases similar to "we can expect 2-9 more
sessions to click on a search result when they have the new
feature" [3]
* The Search team wrapped up research into the ElasticSearch
instabilities on the eqiad search cluster that occurred on Aug 6,
2016; nothing conclusive was found. [4]


== Events and News ==

=== Interactive ===
*  has been enabled on all wikis (announced via email to
wikitech-l) [5]
* Geoshapes data service is now integrated into all maps [6]

=== Search ===
* Turned off BM25 A/B test, awaiting analysis [7]
* Pushed into production a change that implemented ascii-folding
for French [8]
* Improved balance of nodes across rows for ElasticSearch eqiad
cluster [9]

=== Portal ===
* Currently blocked on this check-in to gerrit [10]


== Other Noteworthy Stuff' ==
* Our elasticsearch clusters now have "row aware shard
allocation". This means that we can theoretically lose one row of
servers in our datacenter and still serve search traffic. [11]
* The Search team sent out a request for comment article that was
posted to various Village Pumps asking for it to be translated. [12]
** This was in reference to the cross-wiki search results new
functionality and design articles on MediaWiki. [13], [14]


== Did you know? ==
* A study came out yesterday showing that giraffes are actually
four distinct species, rather than one (article and BBC report).
[15], [16]
** Of course, the English and German Wikipedia pages on giraffes
have already been updated! [17], [18]


[1]

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_plwiki_arwiki_zhwiki_and_nlwiki


[2] https://phabricator.wikimedia.org/T143354

[3] https://phabricator.wikimedia.org/T140187

[4] https://phabricator.wikimedia.org/T142506

[5]
https://lists.wikimedia.org/pipermail/wikitech-l/2016-September/086490.html


[6]

https://www.mediawiki.org/wiki/Help:Extension:Kartographer#GeoShapes_external_data


[7] https://phabricator.wikimedia.org/T143588

[8] https://phabricator.wikimedia.org/T144429

[9] https://phabricator.wikimedia.org/T143685

[10] https://gerrit.wikimedia.org/r/#/c/306241/

[11] https://phabricator.wikimedia.org/T143571

[12]

https://meta.wikimedia.org/wiki/User:DTankersley_(WMF)/translation_request_for_cross-wiki_search_results


Re: [Wikitech-l] How to find pages using the Score extension?

2016-03-15 Thread David Causse

Hi,

It's maybe not the best solution but you could use the insource[1] 
search syntax, i.e. on english wikipedia: insource:/\/ [2].


[1] https://en.wikipedia.org/wiki/Help:Searching#Parameters
[2] 
https://en.wikipedia.org/w/index.php?title=Special%3ASearch=default=insource%3A%2F%5C%3Cscore%5C%3E%2F+insource%3A%22score%22=Search



Le 16/03/2016 00:31, Daniel Mietchen a écrit :

Hi,

I want to get an idea how many times the Score extension is invoked
but could find no way to go about that, nor to get a list of pages
using  or variants thereof on a per-wiki or per-category basis.

Any pointers?

Thanks,
d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] Looking for more additions to SWAT (deploy) team

2016-03-03 Thread David Causse

Hi Greg,

I'm willing to help (mostly morning SWAT).
I've never deployed anything so I'll need some help for the first patches.

David.

Le 02/03/2016 23:21, Greg Grossmeier a écrit :

As you know, one of the quickest ways of getting your fix backported to
production (from master) is to use a SWAT window:
https://wikitech.wikimedia.org/wiki/SWAT_deploys

And now, I ask for new volunteers!

Would you like to help make developers happy and be a part of the crew
of people who deploy during these windows? Of course you would!

These happen twice a day every work day (except Friday, naturally) at
8:00 SF (16:00 UTC currently) and 16:00 SF (00:00 UTC).

We're open to both those already familiar with deploying to Wikimedia
servers and those who are not; we're friendly and willing to teach :)

Requirements are:
* A shell account on production
** See: https://wikitech.wikimedia.org/wiki/Requesting_shell_access
* Availability during one of the two windows on a regular basis
* A willingness to learn and being comfortable with asking for
   help/advice when you aren't sure, especially when you aren't sure what
   a particular patch will actually do in production.


What you'll get:
* Fancy access to deploy to "A Top 10 Web Property"(TM)(C)
* Support from current SWATers to get started
* A t-shirt and/or sticker if you end up breaking, AND FIXING,
   production

Let me know if you're interested!

Greg




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] New Beta Feature: completion suggester

2015-12-21 Thread David Causse

Le 20/12/2015 21:55, Sage Ross a écrit :

If I'm, say, building a web app that could benefit from that kind of
search suggestion tool, is there an API I can use?


The API endpoing is action=cirrus-suggest[1] and accepts 2 parameters: 
text for the user input and limit (5 by default).


Example : 
/w/api.php?action=cirrus-suggest=json=albert%20einstein=5


Note that this API is highly experimental and is subject to change. I'd 
suggest to use it only for evaluation purpose at this point. We may 
provide a better integration in the mediawiki API ecosystem (i.e. 
generators[2]) in the coming weeks.


[1] 
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=cirrus-suggest=Main%20Page=revisions=content=jsonfm

[2] https://www.mediawiki.org/wiki/API:Query#Generators

David

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] New Beta Feature: completion suggester

2015-12-21 Thread David Causse

Le 20/12/2015 22:19, John Erling Blad a écrit :

I tried this on a search for "Sør-Aurdal" (a municipality in Norway),
dropped the dash and wrote "sørau" and got a hit on "Søraust-Svalbard
naturreservat" among other things. The topmost hit was "søraurdøl", which
is a denomyn for someone from Sør-Aurdal. It seems to me that a spelling
error is compensated with a fuzzy search for long(est?) words, but that
imply nearly completing the word if there is a spelling error.


Thank you, this is exactly the kind of feedback we were looking for when 
we deployed this feature as a beta feature.


In this case the first thing to note is that "søraurdøl" [1] is a 
redirect to "Sør-Aurdal" [2]. The completion suggester won't display 
multiple suggestions that have the same target page. Here it will 
receive internally both "søraurdøl" and "Sør-Aurdal" but because these 
pages are related to "Sør-Aurdal" it will have to decide which one to 
display and will choose "søraurdøl" because the query "sørau" is a 
perfect prefix hit.
You can see when the algorithm will prefer "Sør-Aurdal" by continuing 
typing :

"søraud" => "søraurdøl" (still a perfect prefix)
"sørauda" => "Sør-Aurdal" (here both are not perfect prefix and thus 
will decide to display the canonical page "Sør-Aurdal")


There are many knobs we could adjust to display better suggestions. Here 
I can see two of them:


1. At index time the suggester will group redirects that are very 
similar to the canonical title:
On enwiki the redirect "Albert Enstein" is grouped with its canonical 
page "Albert Einstein", "Albert Enstein" will never be proposed to the 
suggester and thus won't have to choose between "Albert Enstein" and 
"Albert Einstein". It will always display "Albert Einstein". This 
technique allows us to display proper suggestions even if the user types 
something very far like "alberensten". Here the suggester can take 
benefits from popular pages that have been manually curated by editors 
with common typos.
Unfortunately such arbitrary decisions have also drawbacks, a counter 
example is "life a", on enwiki this query will suggest "Life insurance" 
instead of "life assurance" because the redirect "Life assurance" has 
been wrongly grouped with "Life insurance". This is not completely 
wrong, both suggestions will lead to the same page, but it's not perfect...
So we could fix the "sørau" problem by increasing the tolerance of this 
"grouping step" but unfortunately we will increase the number of cases 
like "life assurance".


2. Change the decision at query time
We could also change the decision and always prefer canonical pages vs 
redirects even if the canonical page is not a perfect prefix hit. I'm 
not aware of a counter example here but since our ranking algorithm is 
far from perfect we preferred to choose perfect prefix hits for now.
In the coming months we should be able to include pageviews statistics 
in the formula, we hope to see positive improvements with such metrics 
and will hopefully allow us to review this decision.


As you can see the suggester will make arbitrary decisions (sometimes 
hazardous) that could be wrong and this is the whole purpose of having 
this feature in beta. Depending on feedback like yours we may review and 
adjust various parameters in the algorithm.


Thank you!

David.

[1] (Omdirigert fra Søraurdøl): 
https://no.wikipedia.org/w/index.php?title=S%C3%B8raurd%C3%B8l=no
[2] 
https://no.wikipedia.org/w/api.php?action=query=backlinks=S%C3%B8r-Aurdal=redirects

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] New Beta Feature: completion suggester

2015-12-21 Thread David Causse

Le 21/12/2015 16:12, Brad Jorsch (Anomie) a écrit :


You should have implemented isInternal() to return true in your module, so
the auto-generated documentation would properly reflect that status.


I'll fix it, thanks for the advice.


I'd suggest to use it only for evaluation purpose at this point. We may
provide a better integration in the mediawiki API ecosystem (i.e.
generators[2]) in the coming weeks.


Does your plan for "better integration" include making it the backend for
action=opensearch when CirrusSearch is installed? That would allow
browsers' search bars to benefit too.


It was the initial plan but for simplicity reasons I preferred to bind 
the MW js API searchSuggest to the cirrus-suggest internal API.
If the completion suggester is proven successful and useful it will be a 
nice candidate for TitlePrefixSearch replacement in opensearch.



I'd recommend against a non-beta CirrusSearch module for suggestions,
versus something in core that Cirrus provides the backend for. That
something is probably the existing list=prefixsearch.[1]


I agree. On this point I will follow any recommendations from API 
maintainers, my knowledge of the current API ecosystem is too limited to 
make any good decision here.


Thanks!

David.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l