On Thu, Oct 29, 2015 at 8:47 AM, Strainu <[email protected]> wrote:
> Hi, > > I've been reading the mw.org and wikitech pages on Cirrussearch (and > the code) in the hope that I will be able to understand how is the > page content transformed before being sent to ES and how is it kept in > ES and I have a few questions: > > 1. Is the documentation available anywhere? I don't see it on > https://doc.wikimedia.org/ > > Feature documentation is at https://www.mediawiki.org/wiki/Help:CirrusSearch, operational documentation is at https://wikitech.wikimedia.org/wiki/Search > 2. What part of the whole ecosystem transforms the wikitext into > indexable text? Where can I find it? It should be somewhere downstream > fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout > where exactly. > > The documents are built using the classes in https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/includes/BuildDocument > If this transformation doesn't happen, from where is the searchable > text obtained? > > 3. Where can I find the ES schema used for wikipages? Is it different > for images/categories? > > ES schema is the same everywhere, the easiest way to see what the data looks like is just request a dump for a particular page. This will output json, i use a chrome extension called JsonView to make this look nice: https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump > Thanks, > Strainu > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
