Hi,
No there isn't... you'd have to look at the logic that is used in
PDFStreamEngine.showText to convert the raw stuff into readable strings.
It also depends on the current font.
And the problem is that a word will often be splitted on several tokens.
See
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
Tilman
Am 02.12.2018 um 00:03 schrieb Nick Westerly:
I'm using the method here to remove text from a document:
http://www.docjar.com/html/api/org/apache/pdfbox/examples/util/RemoveAllText.java.html
And then rendering the page to an image.
I'd like to do exactly as I'm doing, except leave certain pieces of text if
they match a regex pattern (i'm looking for sequences of dashes).
For this part of the parsing, I'd like to implement a method that checks
the textual representations of the prevToken, and only removes it if it
doesn't match my string. Are there any helper methods to get the text here
given an element like this (possibly in pdf text stripper or otherwise)? Or
do i have to manually parse the text?
for (Object token : tokens) {
if (token instanceof Operator) {
Operator op = (Operator) token;
if (op.getName().equals("TJ") || op.getName().equals("Tj")) {
//remove the one argument to this operator
Object prevToken = newTokens.get(newTokens.size() - 1);
if(!matchesMyString(prevToken)) {
newTokens.remove(newTokens.size() - 1);
}
continue;
}
}
newTokens.add(token);
}
Thanks
Nick