Re: How can I access to the TextExtractor result?

Sébastien Launay Tue, 24 Nov 2009 10:57:14 -0800

I you  to get their hands dirty

2009/11/24 Paco Avila <[email protected]>:
> Thanks, this is the expected answer :(
>
> Anyway, there is any way to detect a failed text extraction ? I know,
> I can see the log but the failure it not associated to a file or path.
>
> Some times when I upload a document (word, pdf, etc.) to my DMS build
> on Jackrabbit, it is not indexed. Office documents seems to be
> specially problematic due to its propietary format. And the problem is
> that I don't know which document had problems it their text
> extraction, specially if use extractorPoolSize > 1.
>
> Perhaps this question should be send to the development list? I thinks
> this can be a very useful improvement to Jackrabbit.
>
> On Tue, Nov 24, 2009 at 5:50 PM, Jukka Zitting <[email protected]> 
> wrote:
>> Hi,
>>
>> On Tue, Nov 24, 2009 at 5:37 PM, Paco Avila <[email protected]> wrote:
>>> I wonder if I can access the text produced by the TextExtractor from a
>>> document file (like a PDF, for example)
>>
>> Jackrabbit doesn't store the extracted text anywhere, it is just used
>> to add the document to the inverted Lucene index.
>>
>> You can always use the text extractor directly to get the text
>> content. Check out http://lucene.apache.org/tika/ for more details
>> about the Tika toolkit that we nowadays use for text extraction.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
>
>
> --
> Paco Avila
> OpenKM
> http://www.openkm.com
> http://www.guia-ubuntu.org
>




-- 
Sébastien Launay

Re: How can I access to the TextExtractor result?

Reply via email to