Re: Removing ALMOST all text from a pdf

Tilman Hausherr Sun, 02 Dec 2018 01:17:11 -0800

Hi,

No there isn't... you'd have to look at the logic that is used inPDFStreamEngine.showText to convert the raw stuff into readable strings.It also depends on the current font.


And the problem is that a word will often be splitted on several tokens.

See
https://pdfbox.apache.org/2.0/migration.html

Why was the ReplaceText example removed?

Tilman

Am 02.12.2018 um 00:03 schrieb Nick Westerly:

I'm using the method here to remove text from a document:

http://www.docjar.com/html/api/org/apache/pdfbox/examples/util/RemoveAllText.java.html

And then rendering the page to an image.

I'd like to do exactly as I'm doing, except leave certain pieces of text if
they match a regex pattern (i'm looking for sequences of dashes).

For this part of the parsing, I'd like to implement a method that checks
the textual representations of the prevToken, and only removes it if it
doesn't match my string. Are there any helper methods to get the text here
given an element like this (possibly in pdf text stripper or otherwise)? Or
do i have to manually parse the text?

for (Object token : tokens) {
     if (token instanceof Operator) {
         Operator op = (Operator) token;
         if (op.getName().equals("TJ") || op.getName().equals("Tj")) {
             //remove the one argument to this operator
             Object prevToken = newTokens.get(newTokens.size() - 1);
             if(!matchesMyString(prevToken)) {
                 newTokens.remove(newTokens.size() - 1);
             }
             continue;
         }
     }
     newTokens.add(token);
}

Thanks

Nick

Re: Removing ALMOST all text from a pdf

Reply via email to