Thanks for the response Erik, it's been very informative. I have a few follow up questions (inline)
On 29 octombrie 2015 17:56:25 EET, Erik Bernhardson <[email protected]> wrote: >On Thu, Oct 29, 2015 at 8:47 AM, Strainu <[email protected]> wrote: > >> Hi, >> >> I've been reading the mw.org and wikitech pages on Cirrussearch (and >> the code) in the hope that I will be able to understand how is the >> page content transformed before being sent to ES and how is it kept >in >> ES and I have a few questions: >> >> 1. Is the documentation available anywhere? I don't see it on >> https://doc.wikimedia.org/ >> >> >Feature documentation is at >https://www.mediawiki.org/wiki/Help:CirrusSearch, >operational documentation is at >https://wikitech.wikimedia.org/wiki/Search I was referring to the code docs, they make it easier to follow the class hierarchy. > > >> 2. What part of the whole ecosystem transforms the wikitext into >> indexable text? Where can I find it? It should be somewhere >downstream >> fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout >> where exactly. >> >> >The documents are built using the classes in >https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/includes/BuildDocument I see you use already parsed text. I'm wondering if using the output of mwparserfromhell would work - I have some wikitext that is not in a mw database that I would like to index. I'm guessing I'll have to write some code, but the idea would be the same. > > >> If this transformation doesn't happen, from where is the searchable >> text obtained? >> >> 3. Where can I find the ES schema used for wikipages? Is it different >> for images/categories? >> >> >ES schema is the same everywhere, the easiest way to see what the data >looks like is just request a dump for a particular page. This will >output >json, i use a chrome extension called JsonView to make this look nice: >https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump That is very cool indeed. Thanks again, Strainu > > >> Thanks, >> Strainu >> >> _______________________________________________ >> Wikitech-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >_______________________________________________ >Wikitech-l mailing list >[email protected] >https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Sent from my Android device with K-9 Mail. Please excuse my brevity. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
