FWIW, we do index the full text of (PDF and?) DjVu files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.
And we also index office documents via Tika (*.doc and similar).
And I think it should not be a feature of the
I've written about my problem ~2 years ago:
http://wikitech-l.wikimedia.narkive.com/6G0YPmWQ/need-a-way-to-modify-text-before-indexing-was-searchupdate
It seems I've lost the latest message, so I want to answer to it now:
With lsearchd and Elasticsearch, we absolutely wouldn't want to munge
FWIW, we do index the full text of (PDF and?) DjVu files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.
Nemo
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
SearchEngine subclasses can implement getTextFromContent() if they want
to override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search
engine, it's rather a java application that runs separately and extracts
text from binary files like
On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov vita...@yourcmc.ruwrote:
SearchEngine subclasses can implement getTextFromContent() if they want to
override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search
engine, it's rather a java
Hi!
Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22
breaks my TikaMW extension - I used that hook to extract contents from
binary files so the user can then search on it.
Maybe you can add some other hook for this purpose?
See also
On Tue, Jan 14, 2014 at 2:33 PM, vita...@yourcmc.ru wrote:
Hi!
Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22
breaks my TikaMW extension - I used that hook to extract contents from
binary files so the user can then search on it.
Maybe you can add some other hook