FWIW, we do index the full text of (PDF and?) DjVu files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.
And we also index office documents via Tika (*.doc and similar).
And I think it should not be a feature of the
FWIW, we do index the full text of (PDF and?) DjVu files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.
Nemo
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
SearchEngine subclasses can implement getTextFromContent() if they want
to override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search
engine, it's rather a java application that runs separately and extracts
text from binary files like
On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov vita...@yourcmc.ruwrote:
SearchEngine subclasses can implement getTextFromContent() if they want to
override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search
engine, it's rather a java
On Tue, Jan 14, 2014 at 2:33 PM, vita...@yourcmc.ru wrote:
Hi!
Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22
breaks my TikaMW extension - I used that hook to extract contents from
binary files so the user can then search on it.
Maybe you can add some other hook