Re: Extracting content from document

LukashP Sun, 15 Nov 2009 12:14:06 -0800

Is there a direct way - that actually is the question ;)

The problem with my own text extractors is that I would have to override
every single one I use. That is not a problem technically, but I find that
solution somewhat ugly ;). What is more one can read in javadoc here:
http://jackrabbit.apache.org/api/1.4/org/apache/jackrabbit/extractor/MsWordTextExtractor.html
that this method should only throw Exception on transient errors.


I thought about hacking
org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor (maybe there
would be only one class to change), but then I think I would be forced to do
some hacking in Jackrabbit jar files.

So, all in all I think I will stick with overriding every extractor I use.

Thank you for your reply.

Regards, Luke


Dave Brosius-2 wrote:
> 
> If there's no direct way...   :)
> 
> I suppose you could create your own text extractor that derived from 
> MsWordTextExtractor, overrides extractText and delegate to super in a 
> try/catch block.
> 
> Then specify this extractor in your repository.xml file.
> 
> LukashP wrote:
>> Hi,
>> It's my first post here, so please, be tolerant of any mistakes :).
>> I'm importing into Jackrabbit repository a large group of word (*.doc)
>> files
>> (batch operation). I've setup Jackrabbit in a way, that content is
>> extracted
>> immediately along with importing (commiting transaction to be strict).
>> Most of them are fine, and also MsWordExtractor can successfully extract
>> text content (that allows me to use full text search later).
>> However, for some of them I have a problem : The content can't be
>> extracted
>> of whatever reason. That's ok, some of them can be in wrong format or so,
>> but I would like to know about such problem immediately.
>> The problem is, that when MsWordExtractor is not able to extract content,
>> is
>> only logs a warning about it (and i think that's all - log below, i've
>> shown
>> only the significant logs). Is there any way I could know about failure
>> of
>> extraction immediately, when importing ?
>>
>> [15:27:50,699] [WARN ]
>> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed
>> to
>> extract Word text content
>> java.lang.ArrayIndexOutOfBoundsException: 59730
>>      at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
>> ...
>> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
>> ...
>> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
>> ...
>> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
>>      at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)
>>
>> I would be thankful for any help.
>>
>> Regards, 
>> Luke
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://n4.nabble.com/Extracting-content-from-document-tp621776p621866.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Extracting content from document

Reply via email to