I you to get their hands dirty 2009/11/24 Paco Avila <[email protected]>: > Thanks, this is the expected answer :( > > Anyway, there is any way to detect a failed text extraction ? I know, > I can see the log but the failure it not associated to a file or path. > > Some times when I upload a document (word, pdf, etc.) to my DMS build > on Jackrabbit, it is not indexed. Office documents seems to be > specially problematic due to its propietary format. And the problem is > that I don't know which document had problems it their text > extraction, specially if use extractorPoolSize > 1. > > Perhaps this question should be send to the development list? I thinks > this can be a very useful improvement to Jackrabbit. > > On Tue, Nov 24, 2009 at 5:50 PM, Jukka Zitting <[email protected]> > wrote: >> Hi, >> >> On Tue, Nov 24, 2009 at 5:37 PM, Paco Avila <[email protected]> wrote: >>> I wonder if I can access the text produced by the TextExtractor from a >>> document file (like a PDF, for example) >> >> Jackrabbit doesn't store the extracted text anywhere, it is just used >> to add the document to the inverted Lucene index. >> >> You can always use the text extractor directly to get the text >> content. Check out http://lucene.apache.org/tika/ for more details >> about the Tika toolkit that we nowadays use for text extraction. >> >> BR, >> >> Jukka Zitting >> > > > > -- > Paco Avila > OpenKM > http://www.openkm.com > http://www.guia-ubuntu.org >
-- Sébastien Launay
