PDFs and detectAngles

Tim Allison Tue, 12 Jan 2021 08:59:15 -0800

This is a follow up on an earlier discussion.  I compared running our
PDFParser with and without "detectAngles" on the 10k set of PDFs that I've
been using recently.


DetectAngles is not related to image processing or OCR, rather, when this
parameter is set to "true", the PDFParser relies on information about the
orientation of the text runs to stitch the runs back into more accurate
words/lines/sentences.

The results are here:
https://corpora.tika.apache.org/base/reports/detect_angles.tgz

It takes roughly 3x time to "detectAngles" on the test set of 10k PDFs (10
threads, wallclock: 41 seconds vs 141 seconds).  There was a 0.6% increase
in common tokens.  For a few files, the improvement was _dramatic_.  And, I
suspect that our unigram/bag of words approach is not measuring
improvements in multi-word text runs/sentences.

Given the cost in processing time, I'm slightly inclined not to change our
default "false" for Tika 2.0.0.  If anyone disagrees, please open an issue.

Cheers,

              Tim

PDFs and detectAngles

Reply via email to