Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-08 Thread oren bochman
-Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Brion Vibber Sent: Thursday, March 7, 2013 9:59 PM To: Wikimedia developers Subject: Re: [Wikitech-l] Indexing non-text content in LuceneSearch On Thu, Mar 7, 2013

[Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Daniel Kinzler
Hi all! I would like to ask for you input on the question how non-wikitext content can be indexed by LuceneSearch. Background is the fact that full text search (Special:Search) is nearly useless on wikidata.org at the moment, see https://bugzilla.wikimedia.org/show_bug.cgi?id=42234. The reason

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Brion Vibber
On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler dan...@brightbyte.de wrote: 1) create a specialized XML dump that contains the text generated by getTextForSearchIndex() instead of actual page content. That probably makes the most sense; alternately, make a dump that includes both raw data and

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Daniel Kinzler
On 07.03.2013 20:58, Brion Vibber wrote: 3) The indexer code (without plugins) should not know about Wikibase, but it may have hard coded knowledge about JSON. It could have a special indexing mode for JSON, in which the structure is deserialized and traversed, and any values are added

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Munagala Ramanath
(1) seems like the right way to go to me too. There may be other ways but puppet/files/lucene/lucene.jobs.sh has a function called import-db() which creates a dump like this: php $MWinstall/common/multiversion/MWScript.php dumpBackup.php $dbname --current $dumpfile Ram On Thu, Mar 7, 2013