thanks Nick, but I have implemented the try/catch loop already, once those files occur, the loop aborts. btw, I receive the "Java heap space" error when processing one of the doc files, the size of this file is not large, nor does it contain too much content, how can I avoid similar errors?
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at com.sun.imageio.plugins.png.PNGImageReader.readMetadata(PNGImageReader.java:745) at com.sun.imageio.plugins.png.PNGImageReader.getImageMetadata(PNGImageReader.java:1567) at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:97) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:126) at org.apache.tika.parser.microsoft.WordExtractor.handlePictureCharacterRun(WordExtractor.java:500) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:270) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:197) at org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:170) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:100) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:465) at org.apache.tika.Tika.parseToString(Tika.java:577) at org.apache.tika.Tika.parseToString(Tika.java:557 在2015年05月20 20时09分, "Nick Burch"<[email protected]>写道: On Wed, 20 May 2015, YukaChan wrote: > Currently I am processing a bunch of documents such as MS Office and > PDF files, I intend to extract only text out of every document for > further analysis. When Tika meets an enceypted document it is stuck and > the whole extraction is aborted. Actually, it is okay for me to extract > contents from most documents while skip those encrypted, what am I > supposed to do? I'd suggest you wrap each call to tika in a try/catch loop. Assuming you're iterating over lots of files to process, you'd change code like Tika tika = new Tika(); for (File file : getFiles()) { String text = tika.parseToString(file); // Use text } To instead be more like Tika tika = new Tika(); for (File file : getFiles()) { try { String text = tika.parseToString(file); // Use text } catch (Exception e) { // Log the problem with this one file logger.log(Logger.ERROR, "Can't process " + file, e); } } Nick
