Am 22.10.2019 um 15:31 schrieb alist...@seznam.cz:
Hello,




I was looking at your tool for managing and extracting data from .pdf
documents and I’d like to ask you following. Does your library allow keyword
search/count from multiple .pdf format as well as counting all words (also,
is it possible to make an exception for prepositions, conjunctions such)?




Could you, please, provide me with a slight code hint for this tasks
implementation, in case it's possible.




I've read it's been written and designed for Java. Are any APIs compatible
(or can be used) with c++?




Have a nice day.




Thank You!

Mark


There is no direct API for what you want to do. And none for C++ either, only Java (and maybe some other JVM based languages).

Counting words from a text isn't really related to PDF itself, more to text analysis in general. You might want to look at tika-eval, which is part of Apache Tika, which uses PDFBox (and much more).

https://cwiki.apache.org/confluence/display/tika/TikaEval

"The token processor runs language id against content and then selects the appropriate set of common words for its counts. If there is no common words file for a language, then it backs off to the default list, which is currently hardcoded to 'en'."

Even if it may not be 100% of what you want, maybe parts of the sources may be of use for you.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to