Re: PDFBox - Extracting information about keyword/word count and search from multiple PDF files.

Tilman Hausherr Tue, 22 Oct 2019 11:21:25 -0700

Am 22.10.2019 um 15:31 schrieb alist...@seznam.cz:

Hello,





I was looking at your tool for managing and extracting data from .pdf
documents and I’d like to ask you following. Does your library allow keyword
search/count from multiple .pdf format as well as counting all words (also,
is it possible to make an exception for prepositions, conjunctions such)?




Could you, please, provide me with a slight code hint for this tasks
implementation, in case it's possible.




I've read it's been written and designed for Java. Are any APIs compatible
(or can be used) with c++?




Have a nice day.




Thank You!

Mark

There is no direct API for what you want to do. And none for C++ either,only Java (and maybe some other JVM based languages).

Counting words from a text isn't really related to PDF itself, more totext analysis in general. You might want to look at tika-eval, which ispart of Apache Tika, which uses PDFBox (and much more).


https://cwiki.apache.org/confluence/display/tika/TikaEval

"The token processor runs language id against content and then selectsthe appropriate set of common words for its counts. If there is nocommon words file for a language, then it backs off to the default list,which is currently hardcoded to 'en'."

Even if it may not be 100% of what you want, maybe parts of the sourcesmay be of use for you.


Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFBox - Extracting information about keyword/word count and search from multiple PDF files.

Reply via email to