Re: [Wikitech-l] Need a way to modify text before indexing (was SearchUpdate)

2015-10-14 Thread vitalif
FWIW, we do index the full text of (PDF and?) DjVu files on Commons (because it's stored in img_metadata). It's probably the biggest improvement CirrusSearch brought for Commons. And we also index office documents via Tika (*.doc and similar). And I think it should not be a feature of the

[Wikitech-l] Need a way to modify text before indexing (was SearchUpdate)

2015-10-14 Thread vitalif
I've written about my problem ~2 years ago: http://wikitech-l.wikimedia.narkive.com/6G0YPmWQ/need-a-way-to-modify-text-before-indexing-was-searchupdate It seems I've lost the latest message, so I want to answer to it now: With lsearchd and Elasticsearch, we absolutely wouldn't want to munge

Re: [Wikitech-l] Need a way to modify text before indexing (was SearchUpdate)

2015-10-14 Thread Federico Leva (Nemo)
FWIW, we do index the full text of (PDF and?) DjVu files on Commons (because it's stored in img_metadata). It's probably the biggest improvement CirrusSearch brought for Commons. Nemo ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org

Re: [Wikitech-l] Need a way to modify text before indexing (was SearchUpdate)

2014-01-15 Thread Vitaliy Filippov
SearchEngine subclasses can implement getTextFromContent() if they want to override the normal text fetching behavior. I can't put it into SearchEngine subclass because Tika isn't a search engine, it's rather a java application that runs separately and extracts text from binary files like

Re: [Wikitech-l] Need a way to modify text before indexing (was SearchUpdate)

2014-01-15 Thread Chad
On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov vita...@yourcmc.ruwrote: SearchEngine subclasses can implement getTextFromContent() if they want to override the normal text fetching behavior. I can't put it into SearchEngine subclass because Tika isn't a search engine, it's rather a java

[Wikitech-l] Need a way to modify text before indexing (was SearchUpdate)

2014-01-14 Thread vitalif
Hi! Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22 breaks my TikaMW extension - I used that hook to extract contents from binary files so the user can then search on it. Maybe you can add some other hook for this purpose? See also

Re: [Wikitech-l] Need a way to modify text before indexing (was SearchUpdate)

2014-01-14 Thread Chad
On Tue, Jan 14, 2014 at 2:33 PM, vita...@yourcmc.ru wrote: Hi! Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22 breaks my TikaMW extension - I used that hook to extract contents from binary files so the user can then search on it. Maybe you can add some other hook