On Wed, 20 May 2015, YukaChan wrote:
Currently I am processing a bunch of documents such as MS Office and PDF files, I intend to extract only text out of every document for further analysis. When Tika meets an enceypted document it is stuck and the whole extraction is aborted. Actually, it is okay for me to extract contents from most documents while skip those encrypted, what am I supposed to do?

I'd suggest you wrap each call to tika in a try/catch loop. Assuming you're iterating over lots of files to process, you'd change code like

Tika tika = new Tika();
for (File file : getFiles()) {
   String text = tika.parseToString(file);
   // Use text
}

To instead be more like

Tika tika = new Tika();
for (File file : getFiles()) {
   try {
      String text = tika.parseToString(file);
      // Use text
   } catch (Exception e) {
      // Log the problem with this one file
      logger.log(Logger.ERROR, "Can't process " + file, e);
   }
}

Nick

Reply via email to