Is there a direct way - that actually is the question ;) The problem with my own text extractors is that I would have to override every single one I use. That is not a problem technically, but I find that solution somewhat ugly ;). What is more one can read in javadoc here: http://jackrabbit.apache.org/api/1.4/org/apache/jackrabbit/extractor/MsWordTextExtractor.html that this method should only throw Exception on transient errors.
I thought about hacking org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor (maybe there would be only one class to change), but then I think I would be forced to do some hacking in Jackrabbit jar files. So, all in all I think I will stick with overriding every extractor I use. Thank you for your reply. Regards, Luke Dave Brosius-2 wrote: > > If there's no direct way... :) > > I suppose you could create your own text extractor that derived from > MsWordTextExtractor, overrides extractText and delegate to super in a > try/catch block. > > Then specify this extractor in your repository.xml file. > > LukashP wrote: >> Hi, >> It's my first post here, so please, be tolerant of any mistakes :). >> I'm importing into Jackrabbit repository a large group of word (*.doc) >> files >> (batch operation). I've setup Jackrabbit in a way, that content is >> extracted >> immediately along with importing (commiting transaction to be strict). >> Most of them are fine, and also MsWordExtractor can successfully extract >> text content (that allows me to use full text search later). >> However, for some of them I have a problem : The content can't be >> extracted >> of whatever reason. That's ok, some of them can be in wrong format or so, >> but I would like to know about such problem immediately. >> The problem is, that when MsWordExtractor is not able to extract content, >> is >> only logs a warning about it (and i think that's all - log below, i've >> shown >> only the significant logs). Is there any way I could know about failure >> of >> extraction immediately, when importing ? >> >> [15:27:50,699] [WARN ] >> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed >> to >> extract Word text content >> java.lang.ArrayIndexOutOfBoundsException: 59730 >> at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475) >> ... >> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64) >> ... >> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701) >> ... >> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) >> at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288) >> >> I would be thankful for any help. >> >> Regards, >> Luke >> >> > > > -- View this message in context: http://n4.nabble.com/Extracting-content-from-document-tp621776p621866.html Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
