Quick update : - Any of this solutions proved to be impossible to implement. As it occured, classes that call Extractors (upper in hierarchy) catch everything - even Errors :/ And all I got was different warning.
Regards LukashP wrote: > > Is there a direct way - that actually is the question ;) > > The problem with my own text extractors is that I would have to override > every single one I use. That is not a problem technically, but I find that > solution somewhat ugly ;). What is more one can read in javadoc here: > http://jackrabbit.apache.org/api/1.4/org/apache/jackrabbit/extractor/MsWordTextExtractor.html > that this method should only throw Exception on transient errors. > > I thought about hacking > org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor (maybe > there would be only one class to change), but then I think I would be > forced to do some hacking in Jackrabbit jar files. > > So, all in all I think I will stick with overriding every extractor I use. > > Thank you for your reply. > > Regards, Luke > > > Dave Brosius-2 wrote: >> >> If there's no direct way... :) >> >> I suppose you could create your own text extractor that derived from >> MsWordTextExtractor, overrides extractText and delegate to super in a >> try/catch block. >> >> Then specify this extractor in your repository.xml file. >> >> LukashP wrote: >>> Hi, >>> It's my first post here, so please, be tolerant of any mistakes :). >>> I'm importing into Jackrabbit repository a large group of word (*.doc) >>> files >>> (batch operation). I've setup Jackrabbit in a way, that content is >>> extracted >>> immediately along with importing (commiting transaction to be strict). >>> Most of them are fine, and also MsWordExtractor can successfully extract >>> text content (that allows me to use full text search later). >>> However, for some of them I have a problem : The content can't be >>> extracted >>> of whatever reason. That's ok, some of them can be in wrong format or >>> so, >>> but I would like to know about such problem immediately. >>> The problem is, that when MsWordExtractor is not able to extract >>> content, is >>> only logs a warning about it (and i think that's all - log below, i've >>> shown >>> only the significant logs). Is there any way I could know about failure >>> of >>> extraction immediately, when importing ? >>> >>> [15:27:50,699] [WARN ] >>> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed >>> to >>> extract Word text content >>> java.lang.ArrayIndexOutOfBoundsException: 59730 >>> at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475) >>> ... >>> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64) >>> ... >>> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701) >>> ... >>> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) >>> at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288) >>> >>> I would be thankful for any help. >>> >>> Regards, >>> Luke >>> >>> >> >> >> > > -- View this message in context: http://n4.nabble.com/Extracting-content-from-document-tp621776p623316.html Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
