On Thu, Oct 29, 2015 at 8:47 AM, Strainu <[email protected]> wrote:

> Hi,
>
> I've been reading the mw.org and wikitech pages on Cirrussearch (and
> the code) in the hope that I will be able to understand how is the
> page content transformed before being sent to ES and how is it kept in
> ES and I have a few questions:
>
> 1. Is the documentation available anywhere? I don't see it on
> https://doc.wikimedia.org/
>
>
Feature documentation is at https://www.mediawiki.org/wiki/Help:CirrusSearch,
operational documentation is at https://wikitech.wikimedia.org/wiki/Search


> 2. What part of the whole ecosystem transforms the wikitext into
> indexable text? Where can I find it? It should be somewhere downstream
> fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout
> where exactly.
>
>
The documents are built using the classes in
https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/includes/BuildDocument


> If this transformation doesn't happen, from where is the searchable
> text obtained?
>
> 3. Where can I find the ES schema used for wikipages? Is it different
> for images/categories?
>
>
ES schema is the same everywhere, the easiest way to see what the data
looks like is just request a dump for a particular page. This will output
json, i use a chrome extension called JsonView to make this look nice:
https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump


> Thanks,
>    Strainu
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to