On Wed, 20 May 2015, YukaChan wrote:
Currently I am processing a bunch of documents such as MS Office and
PDF files, I intend to extract only text out of every document for
further analysis. When Tika meets an enceypted document it is stuck and
the whole extraction is aborted. Actually, it is okay for me to extract
contents from most documents while skip those encrypted, what am I
supposed to do?
I'd suggest you wrap each call to tika in a try/catch loop. Assuming
you're iterating over lots of files to process, you'd change code like
Tika tika = new Tika();
for (File file : getFiles()) {
String text = tika.parseToString(file);
// Use text
}
To instead be more like
Tika tika = new Tika();
for (File file : getFiles()) {
try {
String text = tika.parseToString(file);
// Use text
} catch (Exception e) {
// Log the problem with this one file
logger.log(Logger.ERROR, "Can't process " + file, e);
}
}
Nick