Re: How to automatically evaluate the quality of the text extraction result by PDFBox?

2014-05-10 Thread Peter Murray-Rust
There is a great deal of formal activity in this area - see TREC ( http://en.wikipedia.org/wiki/Text_Retrieval_Conference) which runs competitions and provides metrics. Formally a lot of effort is required to produce a precise, reproducible number. In simple terms you need a corpus which has alrea

How to automatically evaluate the quality of the text extraction result by PDFBox?

2014-05-10 Thread Qingchao Kong
Hi, I am using PDFBox to extract text from PDF files. As you know, due to some reason, PDFbox might produce errors when extracting text from some PDF files, the question I want to ask is that: is there a way to automatically evaluate the quality of text extraction result? Or can PDFBox offer a conf