The question here is how does the text still show up in the output file??? I assume the text should have been cached somewhere else in the PDF! I don't know if my assumption is correct, do you have any explanation for that?
On Tue, Mar 24, 2015 at 10:46 AM, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > > > Am 24.03.2015 um 10:43 schrieb a7med shre3y <a7med.shr...@gmail.com>: > > > > I mean how to find them in the PDF while rotating over the tokens, what > is > > the operator? > > > > On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun <sahy...@fileaffairs.de > > > > wrote: > > > >> > >>> Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shr...@gmail.com>: > >>> > >>> What are the drawing commands? I'd then investigate one how to specify > >> the > >>> text ones. > >>> > >> > >> 738.7469 167.1278 m > > MoveTo > > >> 733.8743 167.1278 l > >> > > LineTo > > > >> > >> > >>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun < > sahy...@fileaffairs.de > >>> > >>> wrote: > >>> > >>>> > >>>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shr...@gmail.com > >: > >>>>> > >>>>> That's true, I've even tried to change the rendering text mode to > other > >>>>> values already as mentioned in the PDF specs 1.5 table 5.3 before > >>>> removing > >>>>> it also didn't work. > >>>>> So how to remove the graphics content then? > >>>> > >>>> the simple answer - remove the drawing commands. > >>>> > >>>> The longer answer as you obviously don't want to remove all drawing > >>>> commands you'd need to find which are the ones drawing the text. As > you > >>>> would like to remove certain vectors which are matching a certain > >>>> character/glyph you first need to find out which are the ones drawing > >> e.g. > >>>> the letter 'T'. I don't think that this is doable in a reasonable > >> amount of > >>>> time for arbitary text. > >>>> > >>>> Maruan > >>>> > >>>> > >>>>> > >>>>> Best Regards, > >>>>> > >>>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun < > >> sahy...@fileaffairs.de > >>>>> > >>>>> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y < > a7med.shr...@gmail.com > >>> : > >>>>>>> > >>>>>>> You can download it from here: > >>>>>>> > >>>>>> > >>>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing > >>>>>>> > >>>>>> > >>>>>> looking more closely you correctly replaced the text, but that text > >> was > >>>> in > >>>>>> there for searching within the PDF as it used text rendering mode 3 > >>>>>> (invisible). The 'text' you are still seeing is drawn using vector > >>>> commands > >>>>>> so it's graphics content. > >>>>>> > >>>>>> BR > >>>>>> Maruan > >>>>>> > >>>>>> > >>>>>>> Best Regards, > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun < > >>>> sahy...@fileaffairs.de> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y < > >> a7med.shr...@gmail.com > >>>>> : > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> In fact PDFBox call the operation of transforming "7R %H > $SSURYHG" > >> to > >>>>>> "To > >>>>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or > >>>> decoding, I > >>>>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be > >> Approved" > >>>>>> and > >>>>>>>>> not the opposite (or at least I don't know). I spent some quite > >> long > >>>>>> time > >>>>>>>>> trying to find out how to find the character codes for the glyphs > >> in > >>>>>> the > >>>>>>>>> currently used font, then I found that it's not an easy task. By > >> the > >>>>>> way, > >>>>>>>>> if you know how to do that, I'd so much appreciate it because I > >> need > >>>>>> that > >>>>>>>>> for replacing text with another text and for that the new text > must > >>>> be > >>>>>>>>> encoded the same way as the original! > >>>>>>>>> > >>>>>>>>> Back to the text removal, I am able to find the text and also > >> remove > >>>> it > >>>>>>>> by > >>>>>>>>> calling reset, as I mentioned in my first email, when I print the > >>>>>> output > >>>>>>>>> content I don't find the text anymore but I still see it when I > >> open > >>>>>> the > >>>>>>>>> file. My first assumption was that there must be some other way > to > >>>>>> remove > >>>>>>>>> the text other than the way I am using, and that's what you've > >>>> actually > >>>>>>>>> confirmed in your reply, so could you please tell me what still > >>>>>> missing? > >>>>>>>>> > >>>>>>>> > >>>>>>>> Could you upload the PDF with the reset text too? > >>>>>>>> > >>>>>>>> BR > >>>>>>>> Maruan > >>>>>>>> > >>>>>>>> > >>>>>>>>> Thanks and regards, > >>>>>>>>> a7mad > >>>>>>>>> > >>>>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun < > >>>>>> sahy...@fileaffairs.de> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y < > >>>> a7med.shr...@gmail.com > >>>>>>> : > >>>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> Here's how I do it: > >>>>>>>>>>> > >>>>>>>>>>> 1. I use the following method to encode the text: > >>>>>>>>>>> > >>>>>>>>>>> String encode(String text, PDFont font) throws Exception { > >>>>>>>>>>> StringBuilder builder = new StringBuilder(); > >>>>>>>>>>> byte[] stringBytes = text.getBytes(); > >>>>>>>>>>> int codeLength = 1; > >>>>>>>>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ > >>>>>>>>>>> String c = font.encode(stringBytes, i, codeLength); > >>>>>>>>>>> if(c == null && (i + 1 < stringBytes.length)){ > >>>>>>>>>>> codeLength++; > >>>>>>>>>>> c = font.encode(stringBytes, i, codeLength); > >>>>>>>>>>> } > >>>>>>>>>>> builder.append(c); > >>>>>>>>>>> } > >>>>>>>>>>> return builder.toString(); > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> 2. Iterating through the tokens, I find the text either it's a > >>>>>>>> COSString > >>>>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if > it's > >>>> the > >>>>>>>> text > >>>>>>>>>>> I'm looking for to remove as following: > >>>>>>>>>>> > >>>>>>>>>>> if (op.getOperation().equals("Tj")) { > >>>>>>>>>>> COSString previous = (COSString) > >>>>>> tokens.get(j > >>>>>>>>>> - > >>>>>>>>>>> 1); > >>>>>>>>>>> String string = previous.getString(); > >>>>>>>>>>> String encodedString = encode(string, > >>>> font); > >>>>>>>>>> > >>>>>>>>>> that string is already encoded. So you'd need to encode "To Be > >>>>>> Approved" > >>>>>>>>>> and compare if that matches the string you are reading from the > >> PDF. > >>>>>>>>>> > >>>>>>>>>>> if(encodedString.contains("To Be > >>>>>> Approved")){ > >>>>>>>>>>> previous.reset(); > >>>>>>>>>>> } > >>>>>>>>>>> } else if (op.getOperation().equals("TJ")) { > >>>>>>>>>>> COSArray previous = (COSArray) > >> tokens.get(j > >>>>>> - > >>>>>>>>>>> 1); > >>>>>>>>>>> StringBuilder stringBuilder = new > >>>>>>>>>>> StringBuilder(); > >>>>>>>>>>> for (int k = 0; k < previous.size(); k++) > >> { > >>>>>>>>>>> Object arrElement = > >>>>>>>> previous.getObject(k); > >>>>>>>>>>> if (arrElement instanceof COSString) > { > >>>>>>>>>>> COSString cosString = (COSString) > >>>>>>>>>>> arrElement; > >>>>>>>>>>> > >>>>>>>>>>> stringBuilder.append(cosString.getString()); > >>>>>>>>>>> } > >>>>>>>>>>> } > >>>>>>>>>>> String string = stringBuilder.toString(); > >>>>>>>>>>> String encodedString = encode(string, > >>>> font); > >>>>>>>>>>> if(encodedString.contains("To Be > >>>>>> Approved")){ > >>>>>>>>>>> previous.clear(); > >>>>>>>>>>> } > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> Note: > >>>>>>>>>>> In case of COSArray, I first iterate through the whole array to > >> get > >>>>>> the > >>>>>>>>>>> whole string before encoding and comparison and this works. > >>>>>>>>>>> > >>>>>>>>>>> Best Regards, > >>>>>>>>>>> a7mad > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < > >>>>>>>> sahy...@fileaffairs.de > >>>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> your text is encoded so within the show text operator Tj the > >>>> string > >>>>>> is > >>>>>>>>>>>> > >>>>>>>>>>>> 7R %H $SSURYHG > >>>>>>>>>>>> > >>>>>>>>>>>> You wrote that you encode your string to find it - what do you > >>>> get? > >>>>>>>>>>>> > >>>>>>>>>>>> BR > >>>>>>>>>>>> Maruan > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y < > >>>>>> a7med.shr...@gmail.com > >>>>>>>>> : > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Maruan, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Here's a link from where you can download the PDF. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing > >>>>>>>>>>>>> > >>>>>>>>>>>>> Kind Regards, > >>>>>>>>>>>>> a7mad > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < > >>>>>>>>>> sahy...@fileaffairs.de> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> you need to upload it to a public location as the mailing > list > >>>>>>>> doesn't > >>>>>>>>>>>>>> support attachments. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> BR > >>>>>>>>>>>>>> Maruan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < > >>>>>>>> a7med.shr...@gmail.com > >>>>>>>>>>> : > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Dear Maruan, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thank you very much for the information. Please find > herewith > >>>>>>>>>> attached > >>>>>>>>>>>>>> the PDF to reproduce the problem. > >>>>>>>>>>>>>>> The text to remove is: "To Be Approved". The text has a > >>>>>> multi-byte > >>>>>>>>>>>>>> encoding, so I call first to encode it in order to find it > >> then > >>>>>>>> remove > >>>>>>>>>>>> it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>>>> a7mad > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < > >>>>>>>>>>>> sahy...@fileaffairs.de> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> Dear a7mad, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> removing text from a PDF is not an easy task as > >>>>>>>>>>>>>>>> - text which might visually appear as a single item might > >>>>>>>> consistent > >>>>>>>>>>>> of > >>>>>>>>>>>>>> individual parts within the PDF itself e.g. each character > or > >>>>>> groups > >>>>>>>>>> of > >>>>>>>>>>>>>> characters are place individually in different COSStrings > >>>>>>>>>>>>>>>> - text might be drawn using graphics commands > >>>>>>>>>>>>>>>> - text can appear within different parts of the PDF (e.g. > >> the > >>>>>> text > >>>>>>>>>>>>>> might be content of a form field AND the annotation > >> representing > >>>>>> the > >>>>>>>>>>>> form > >>>>>>>>>>>>>> field visually) > >>>>>>>>>>>>>>>> - you need to look up the encoding information to get form > >> the > >>>>>>>>>>>>>> characters in the PDF "string" to the ones you are looking > for > >>>>>>>>>>>>>>>> …. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> If you can post a specific PDF to a public location and > >>>> describe > >>>>>>>> in > >>>>>>>>>>>>>> detail which string should have been replaced which hasn't I > >>>> will > >>>>>> be > >>>>>>>>>>>> able > >>>>>>>>>>>>>> to tell you why that might have happened. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Maruan > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < > >>>>>>>>>> a7med.shr...@gmail.com > >>>>>>>>>>>>> : > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Currently I am facing a strange problem removing text > from > >>>> the > >>>>>>>> some > >>>>>>>>>>>>>> PDFs. > >>>>>>>>>>>>>>>>> My program is able to find the text and "remove it" by > >>>> calling > >>>>>>>> the > >>>>>>>>>>>>>>>>> COSString.reset() method. > >>>>>>>>>>>>>>>>> The problem is, when I open the output PDF file, I still > >> see > >>>>>> the > >>>>>>>>>> text > >>>>>>>>>>>>>> but > >>>>>>>>>>>>>>>>> not selectable (I mean when I try to highlight it with > the > >>>>>> mouse > >>>>>>>> to > >>>>>>>>>>>>>> copy > >>>>>>>>>>>>>>>>> it, it's not selectable!). When print the content > (tokens) > >> of > >>>>>> the > >>>>>>>>>>>>>> output > >>>>>>>>>>>>>>>>> file, I DO NOT find the text at all!! > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and > >> really > >>>>>>>>>> running > >>>>>>>>>>>>>> out > >>>>>>>>>>>>>>>>> of time. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's > going > >>>> on. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Notes: > >>>>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1 > >>>>>>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some > >> PDFs > >>>>>>>> cause > >>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>>> problem. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thank you very much. > >>>>>>>>>>>>>>>>> a7mad > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>>>>>>>>> To unsubscribe, e-mail: > users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>>>>>> For additional commands, e-mail: > >> users-h...@pdfbox.apache.org > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>>>>>>>>> To unsubscribe, e-mail: > users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>>>>> For additional commands, e-mail: > >> users-h...@pdfbox.apache.org > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>> > >>>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>> > >>>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >