https://bugzilla.wikimedia.org/show_bug.cgi?id=45983

       Web browser: ---
            Bug ID: 45983
           Summary: Enable creation of dumps dedicated to feeding a search
                    index
           Product: MediaWiki
           Version: 1.21-git
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: ContentHandler
          Assignee: wikidata-b...@lists.wikimedia.org
          Reporter: daniel.kinz...@wikimedia.de
                CC: wikidata-b...@lists.wikimedia.org
    Classification: Unclassified
   Mobile Platform: ---

Some search backends, like LuceneSearch, rely on XML dumps to build the search
index. The indexer has no knowledge of content models, so it will index
everything in the dump as-is. For non-text content models, this means it will
index the serialized form, which will often lead to bad results (see bug
42234).

To solve this, a brief discussion on wikitech-l suggests to implement an option
for the dump creation process that would output generated text instead of raw
serialized data into the dumps. This option could then be used to create dumps
especially for rebuilding a search index. See
http://www.gossamer-threads.com/lists/wiki/wikitech/340638

The Content interface already defined the function getTextForSearchIndex for
generating such pseudo-content. It only needs to be hooked up to dump
generation.

-- 
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to