Hi, Here's how I do it:
1. I use the following method to encode the text: String encode(String text, PDFont font) throws Exception { StringBuilder builder = new StringBuilder(); byte[] stringBytes = text.getBytes(); int codeLength = 1; for(int i = 0; i < stringBytes.length; i += codeLength){ String c = font.encode(stringBytes, i, codeLength); if(c == null && (i + 1 < stringBytes.length)){ codeLength++; c = font.encode(stringBytes, i, codeLength); } builder.append(c); } return builder.toString(); } 2. Iterating through the tokens, I find the text either it's a COSString ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text I'm looking for to remove as following: if (op.getOperation().equals("Tj")) { COSString previous = (COSString) tokens.get(j - 1); String string = previous.getString(); String encodedString = encode(string, font); if(encodedString.contains("To Be Approved")){ previous.reset(); } } else if (op.getOperation().equals("TJ")) { COSArray previous = (COSArray) tokens.get(j - 1); StringBuilder stringBuilder = new StringBuilder(); for (int k = 0; k < previous.size(); k++) { Object arrElement = previous.getObject(k); if (arrElement instanceof COSString) { COSString cosString = (COSString) arrElement; stringBuilder.append(cosString.getString()); } } String string = stringBuilder.toString(); String encodedString = encode(string, font); if(encodedString.contains("To Be Approved")){ previous.clear(); } } Note: In case of COSArray, I first iterate through the whole array to get the whole string before encoding and comparison and this works. Best Regards, a7mad On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > Hi, > > your text is encoded so within the show text operator Tj the string is > > 7R %H $SSURYHG > > You wrote that you encode your string to find it - what do you get? > > BR > Maruan > > > > > Am 23.03.2015 um 22:01 schrieb a7med shre3y <a7med.shr...@gmail.com>: > > > > Hi Maruan, > > > > Here's a link from where you can download the PDF. > > > > > https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing > > > > Kind Regards, > > a7mad > > > > On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <sahy...@fileaffairs.de> > > wrote: > > > >> Hi, > >> > >> you need to upload it to a public location as the mailing list doesn't > >> support attachments. > >> > >> BR > >> Maruan > >> > >>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <a7med.shr...@gmail.com>: > >>> > >>> Dear Maruan, > >>> > >>> Thank you very much for the information. Please find herewith attached > >> the PDF to reproduce the problem. > >>> The text to remove is: "To Be Approved". The text has a multi-byte > >> encoding, so I call first to encode it in order to find it then remove > it. > >>> > >>> Best Regards, > >>> a7mad > >>> > >>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < > sahy...@fileaffairs.de> > >> wrote: > >>>> Dear a7mad, > >>>> > >>>> removing text from a PDF is not an easy task as > >>>> - text which might visually appear as a single item might consistent > of > >> individual parts within the PDF itself e.g. each character or groups of > >> characters are place individually in different COSStrings > >>>> - text might be drawn using graphics commands > >>>> - text can appear within different parts of the PDF (e.g. the text > >> might be content of a form field AND the annotation representing the > form > >> field visually) > >>>> - you need to look up the encoding information to get form the > >> characters in the PDF "string" to the ones you are looking for > >>>> …. > >>>> > >>>> If you can post a specific PDF to a public location and describe in > >> detail which string should have been replaced which hasn't I will be > able > >> to tell you why that might have happened. > >>>> > >>>> Maruan > >>>> > >>>> > >>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <a7med.shr...@gmail.com > >: > >>>>> > >>>>> Hi all, > >>>>> > >>>>> Currently I am facing a strange problem removing text from the some > >> PDFs. > >>>>> My program is able to find the text and "remove it" by calling the > >>>>> COSString.reset() method. > >>>>> The problem is, when I open the output PDF file, I still see the text > >> but > >>>>> not selectable (I mean when I try to highlight it with the mouse to > >> copy > >>>>> it, it's not selectable!). When print the content (tokens) of the > >> output > >>>>> file, I DO NOT find the text at all!! > >>>>> > >>>>> I am currently stuck in the PDF specifications 1.5 and really running > >> out > >>>>> of time. > >>>>> > >>>>> I'd so much appreciate any help or any idea on what's going on. > >>>>> > >>>>> Notes: > >>>>> 1. I use use PDFBox 1.7.1 > >>>>> 2. This problem does not occur with all PDFs, only some PDFs cause > >> this > >>>>> problem. > >>>>> > >>>>> Thank you very much. > >>>>> a7mad > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >