Am 22.10.2019 um 15:31 schrieb alist...@seznam.cz:
Hello,
I was looking at your tool for managing and extracting data from .pdf
documents and I’d like to ask you following. Does your library allow keyword
search/count from multiple .pdf format as well as counting all words (also,
is it possible to make an exception for prepositions, conjunctions such)?
Could you, please, provide me with a slight code hint for this tasks
implementation, in case it's possible.
I've read it's been written and designed for Java. Are any APIs compatible
(or can be used) with c++?
Have a nice day.
Thank You!
Mark
There is no direct API for what you want to do. And none for C++ either,
only Java (and maybe some other JVM based languages).
Counting words from a text isn't really related to PDF itself, more to
text analysis in general. You might want to look at tika-eval, which is
part of Apache Tika, which uses PDFBox (and much more).
https://cwiki.apache.org/confluence/display/tika/TikaEval
"The token processor runs language id against content and then selects
the appropriate set of common words for its counts. If there is no
common words file for a language, then it backs off to the default list,
which is currently hardcoded to 'en'."
Even if it may not be 100% of what you want, maybe parts of the sources
may be of use for you.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org