Re: Extracting content from document

LukashP Wed, 18 Nov 2009 02:30:48 -0800

Quick update :
- Any of this solutions proved to be impossible to implement.
As it occured, classes that call Extractors (upper in hierarchy) catch
everything - even Errors :/
And all I got was different warning.


Regards



LukashP wrote:
> 
> Is there a direct way - that actually is the question ;)
> 
> The problem with my own text extractors is that I would have to override
> every single one I use. That is not a problem technically, but I find that
> solution somewhat ugly ;). What is more one can read in javadoc here:
> http://jackrabbit.apache.org/api/1.4/org/apache/jackrabbit/extractor/MsWordTextExtractor.html
> that this method should only throw Exception on transient errors.
> 
> I thought about hacking
> org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor (maybe
> there would be only one class to change), but then I think I would be
> forced to do some hacking in Jackrabbit jar files.
> 
> So, all in all I think I will stick with overriding every extractor I use.
> 
> Thank you for your reply.
> 
> Regards, Luke
> 
> 
> Dave Brosius-2 wrote:
>> 
>> If there's no direct way...   :)
>> 
>> I suppose you could create your own text extractor that derived from 
>> MsWordTextExtractor, overrides extractText and delegate to super in a 
>> try/catch block.
>> 
>> Then specify this extractor in your repository.xml file.
>> 
>> LukashP wrote:
>>> Hi,
>>> It's my first post here, so please, be tolerant of any mistakes :).
>>> I'm importing into Jackrabbit repository a large group of word (*.doc)
>>> files
>>> (batch operation). I've setup Jackrabbit in a way, that content is
>>> extracted
>>> immediately along with importing (commiting transaction to be strict).
>>> Most of them are fine, and also MsWordExtractor can successfully extract
>>> text content (that allows me to use full text search later).
>>> However, for some of them I have a problem : The content can't be
>>> extracted
>>> of whatever reason. That's ok, some of them can be in wrong format or
>>> so,
>>> but I would like to know about such problem immediately.
>>> The problem is, that when MsWordExtractor is not able to extract
>>> content, is
>>> only logs a warning about it (and i think that's all - log below, i've
>>> shown
>>> only the significant logs). Is there any way I could know about failure
>>> of
>>> extraction immediately, when importing ?
>>>
>>> [15:27:50,699] [WARN ]
>>> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed
>>> to
>>> extract Word text content
>>> java.lang.ArrayIndexOutOfBoundsException: 59730
>>>     at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
>>> ...
>>> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
>>> ...
>>> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
>>> ...
>>> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
>>>     at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)
>>>
>>> I would be thankful for any help.
>>>
>>> Regards, 
>>> Luke
>>>
>>>   
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://n4.nabble.com/Extracting-content-from-document-tp621776p623316.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Extracting content from document

Reply via email to