I'm using JackRabbit as a repository for pdf documents and I have some questions regarding Text Extraction. I'm using the Repository locally, not remotely (rmi, dav). Model 1 as shown in the http://jackrabbit.apache.org/deployment-models.html
In http://wiki.apache.org/jackrabbit/Search you can read that: "*Text extraction is done asynchronously in a in a background thread. That means changed or added text is not available immediately...*". I've also seen the configuration parameters, but I'll like to know a little bit more about how and who is responsible for starting this thread. Can I Keep it from running? (For example when doing a batch upload of documents) , Can I start it? Can anyone give me a hint about this?. Also, I've been getting these 2 warnings after uploading some pdfs. How can I know which documents (binary properties) where causing them?, Is there a way I can handle these warnings with some sort of listener Class? *WARN * PDFStreamEngine: java.io.IOException: Error: expected hex character and not :32 (PDFStreamEngine.java, line 529) java.io.IOException: Error: expected hex character and not :32 at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:316) at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:138) at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:488) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:363) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:50) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189) at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195) at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160) *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165) java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108) at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:235) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189) at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195) at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Thanks, Miguel Prieto
