Hi, I'd just like to thank you all Maruan, Andreas and Peter for your help and support. Unfortunately it seems like I have to put this issue aside for a while as it's quite complex specially that I don't have the knowledge yet for doing it. Thank you once more, and have a nice weekend :-)
a7mad On Tue, Mar 24, 2015 at 1:33 PM, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > > > Am 24.03.2015 um 12:49 schrieb a7med shre3y <a7med.shr...@gmail.com>: > > > > The question here is how does the text still show up in the output > file??? > > as written earlier before the 'text' is a drawing i.e. vector graphics the > same way the ellipses have been drawn. > > > > I assume the text should have been cached somewhere else in the PDF! I > > don't know if my assumption is correct, do you have any explanation for > > that? > > > > On Tue, Mar 24, 2015 at 10:46 AM, Maruan Sahyoun <sahy...@fileaffairs.de > > > > wrote: > > > >> > >>> Am 24.03.2015 um 10:43 schrieb a7med shre3y <a7med.shr...@gmail.com>: > >>> > >>> I mean how to find them in the PDF while rotating over the tokens, what > >> is > >>> the operator? > >>> > >>> On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun < > sahy...@fileaffairs.de > >>> > >>> wrote: > >>> > >>>> > >>>>> Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shr...@gmail.com > >: > >>>>> > >>>>> What are the drawing commands? I'd then investigate one how to > specify > >>>> the > >>>>> text ones. > >>>>> > >>>> > >>>> 738.7469 167.1278 m > >> > >> MoveTo > >> > >>>> 733.8743 167.1278 l > >>>> > >> > >> LineTo > >> > >> > >>>> > >>>> > >>>>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun < > >> sahy...@fileaffairs.de > >>>>> > >>>>> wrote: > >>>>> > >>>>>> > >>>>>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y < > a7med.shr...@gmail.com > >>> : > >>>>>>> > >>>>>>> That's true, I've even tried to change the rendering text mode to > >> other > >>>>>>> values already as mentioned in the PDF specs 1.5 table 5.3 before > >>>>>> removing > >>>>>>> it also didn't work. > >>>>>>> So how to remove the graphics content then? > >>>>>> > >>>>>> the simple answer - remove the drawing commands. > >>>>>> > >>>>>> The longer answer as you obviously don't want to remove all drawing > >>>>>> commands you'd need to find which are the ones drawing the text. As > >> you > >>>>>> would like to remove certain vectors which are matching a certain > >>>>>> character/glyph you first need to find out which are the ones > drawing > >>>> e.g. > >>>>>> the letter 'T'. I don't think that this is doable in a reasonable > >>>> amount of > >>>>>> time for arbitary text. > >>>>>> > >>>>>> Maruan > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Best Regards, > >>>>>>> > >>>>>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun < > >>>> sahy...@fileaffairs.de > >>>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y < > >> a7med.shr...@gmail.com > >>>>> : > >>>>>>>>> > >>>>>>>>> You can download it from here: > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing > >>>>>>>>> > >>>>>>>> > >>>>>>>> looking more closely you correctly replaced the text, but that > text > >>>> was > >>>>>> in > >>>>>>>> there for searching within the PDF as it used text rendering mode > 3 > >>>>>>>> (invisible). The 'text' you are still seeing is drawn using vector > >>>>>> commands > >>>>>>>> so it's graphics content. > >>>>>>>> > >>>>>>>> BR > >>>>>>>> Maruan > >>>>>>>> > >>>>>>>> > >>>>>>>>> Best Regards, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun < > >>>>>> sahy...@fileaffairs.de> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y < > >>>> a7med.shr...@gmail.com > >>>>>>> : > >>>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> In fact PDFBox call the operation of transforming "7R %H > >> $SSURYHG" > >>>> to > >>>>>>>> "To > >>>>>>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or > >>>>>> decoding, I > >>>>>>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be > >>>> Approved" > >>>>>>>> and > >>>>>>>>>>> not the opposite (or at least I don't know). I spent some quite > >>>> long > >>>>>>>> time > >>>>>>>>>>> trying to find out how to find the character codes for the > glyphs > >>>> in > >>>>>>>> the > >>>>>>>>>>> currently used font, then I found that it's not an easy task. > By > >>>> the > >>>>>>>> way, > >>>>>>>>>>> if you know how to do that, I'd so much appreciate it because I > >>>> need > >>>>>>>> that > >>>>>>>>>>> for replacing text with another text and for that the new text > >> must > >>>>>> be > >>>>>>>>>>> encoded the same way as the original! > >>>>>>>>>>> > >>>>>>>>>>> Back to the text removal, I am able to find the text and also > >>>> remove > >>>>>> it > >>>>>>>>>> by > >>>>>>>>>>> calling reset, as I mentioned in my first email, when I print > the > >>>>>>>> output > >>>>>>>>>>> content I don't find the text anymore but I still see it when I > >>>> open > >>>>>>>> the > >>>>>>>>>>> file. My first assumption was that there must be some other way > >> to > >>>>>>>> remove > >>>>>>>>>>> the text other than the way I am using, and that's what you've > >>>>>> actually > >>>>>>>>>>> confirmed in your reply, so could you please tell me what still > >>>>>>>> missing? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Could you upload the PDF with the reset text too? > >>>>>>>>>> > >>>>>>>>>> BR > >>>>>>>>>> Maruan > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Thanks and regards, > >>>>>>>>>>> a7mad > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun < > >>>>>>>> sahy...@fileaffairs.de> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y < > >>>>>> a7med.shr...@gmail.com > >>>>>>>>> : > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Here's how I do it: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1. I use the following method to encode the text: > >>>>>>>>>>>>> > >>>>>>>>>>>>> String encode(String text, PDFont font) throws Exception { > >>>>>>>>>>>>> StringBuilder builder = new StringBuilder(); > >>>>>>>>>>>>> byte[] stringBytes = text.getBytes(); > >>>>>>>>>>>>> int codeLength = 1; > >>>>>>>>>>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ > >>>>>>>>>>>>> String c = font.encode(stringBytes, i, codeLength); > >>>>>>>>>>>>> if(c == null && (i + 1 < stringBytes.length)){ > >>>>>>>>>>>>> codeLength++; > >>>>>>>>>>>>> c = font.encode(stringBytes, i, codeLength); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> builder.append(c); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> return builder.toString(); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2. Iterating through the tokens, I find the text either it's > a > >>>>>>>>>> COSString > >>>>>>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if > >> it's > >>>>>> the > >>>>>>>>>> text > >>>>>>>>>>>>> I'm looking for to remove as following: > >>>>>>>>>>>>> > >>>>>>>>>>>>> if (op.getOperation().equals("Tj")) { > >>>>>>>>>>>>> COSString previous = (COSString) > >>>>>>>> tokens.get(j > >>>>>>>>>>>> - > >>>>>>>>>>>>> 1); > >>>>>>>>>>>>> String string = previous.getString(); > >>>>>>>>>>>>> String encodedString = encode(string, > >>>>>> font); > >>>>>>>>>>>> > >>>>>>>>>>>> that string is already encoded. So you'd need to encode "To Be > >>>>>>>> Approved" > >>>>>>>>>>>> and compare if that matches the string you are reading from > the > >>>> PDF. > >>>>>>>>>>>> > >>>>>>>>>>>>> if(encodedString.contains("To Be > >>>>>>>> Approved")){ > >>>>>>>>>>>>> previous.reset(); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> } else if (op.getOperation().equals("TJ")) { > >>>>>>>>>>>>> COSArray previous = (COSArray) > >>>> tokens.get(j > >>>>>>>> - > >>>>>>>>>>>>> 1); > >>>>>>>>>>>>> StringBuilder stringBuilder = new > >>>>>>>>>>>>> StringBuilder(); > >>>>>>>>>>>>> for (int k = 0; k < previous.size(); > k++) > >>>> { > >>>>>>>>>>>>> Object arrElement = > >>>>>>>>>> previous.getObject(k); > >>>>>>>>>>>>> if (arrElement instanceof COSString) > >> { > >>>>>>>>>>>>> COSString cosString = > (COSString) > >>>>>>>>>>>>> arrElement; > >>>>>>>>>>>>> > >>>>>>>>>>>>> stringBuilder.append(cosString.getString()); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> } > >>>>>>>>>>>>> String string = > stringBuilder.toString(); > >>>>>>>>>>>>> String encodedString = encode(string, > >>>>>> font); > >>>>>>>>>>>>> if(encodedString.contains("To Be > >>>>>>>> Approved")){ > >>>>>>>>>>>>> previous.clear(); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> } > >>>>>>>>>>>>> > >>>>>>>>>>>>> Note: > >>>>>>>>>>>>> In case of COSArray, I first iterate through the whole array > to > >>>> get > >>>>>>>> the > >>>>>>>>>>>>> whole string before encoding and comparison and this works. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>> a7mad > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < > >>>>>>>>>> sahy...@fileaffairs.de > >>>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> your text is encoded so within the show text operator Tj the > >>>>>> string > >>>>>>>> is > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 7R %H $SSURYHG > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> You wrote that you encode your string to find it - what do > you > >>>>>> get? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> BR > >>>>>>>>>>>>>> Maruan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y < > >>>>>>>> a7med.shr...@gmail.com > >>>>>>>>>>> : > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Maruan, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Here's a link from where you can download the PDF. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Kind Regards, > >>>>>>>>>>>>>>> a7mad > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < > >>>>>>>>>>>> sahy...@fileaffairs.de> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> you need to upload it to a public location as the mailing > >> list > >>>>>>>>>> doesn't > >>>>>>>>>>>>>>>> support attachments. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> BR > >>>>>>>>>>>>>>>> Maruan > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < > >>>>>>>>>> a7med.shr...@gmail.com > >>>>>>>>>>>>> : > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Dear Maruan, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thank you very much for the information. Please find > >> herewith > >>>>>>>>>>>> attached > >>>>>>>>>>>>>>>> the PDF to reproduce the problem. > >>>>>>>>>>>>>>>>> The text to remove is: "To Be Approved". The text has a > >>>>>>>> multi-byte > >>>>>>>>>>>>>>>> encoding, so I call first to encode it in order to find it > >>>> then > >>>>>>>>>> remove > >>>>>>>>>>>>>> it. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>>>>>> a7mad > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < > >>>>>>>>>>>>>> sahy...@fileaffairs.de> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> Dear a7mad, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> removing text from a PDF is not an easy task as > >>>>>>>>>>>>>>>>>> - text which might visually appear as a single item > might > >>>>>>>>>> consistent > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>> individual parts within the PDF itself e.g. each character > >> or > >>>>>>>> groups > >>>>>>>>>>>> of > >>>>>>>>>>>>>>>> characters are place individually in different COSStrings > >>>>>>>>>>>>>>>>>> - text might be drawn using graphics commands > >>>>>>>>>>>>>>>>>> - text can appear within different parts of the PDF > (e.g. > >>>> the > >>>>>>>> text > >>>>>>>>>>>>>>>> might be content of a form field AND the annotation > >>>> representing > >>>>>>>> the > >>>>>>>>>>>>>> form > >>>>>>>>>>>>>>>> field visually) > >>>>>>>>>>>>>>>>>> - you need to look up the encoding information to get > form > >>>> the > >>>>>>>>>>>>>>>> characters in the PDF "string" to the ones you are looking > >> for > >>>>>>>>>>>>>>>>>> …. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> If you can post a specific PDF to a public location and > >>>>>> describe > >>>>>>>>>> in > >>>>>>>>>>>>>>>> detail which string should have been replaced which > hasn't I > >>>>>> will > >>>>>>>> be > >>>>>>>>>>>>>> able > >>>>>>>>>>>>>>>> to tell you why that might have happened. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Maruan > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < > >>>>>>>>>>>> a7med.shr...@gmail.com > >>>>>>>>>>>>>>> : > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Currently I am facing a strange problem removing text > >> from > >>>>>> the > >>>>>>>>>> some > >>>>>>>>>>>>>>>> PDFs. > >>>>>>>>>>>>>>>>>>> My program is able to find the text and "remove it" by > >>>>>> calling > >>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> COSString.reset() method. > >>>>>>>>>>>>>>>>>>> The problem is, when I open the output PDF file, I > still > >>>> see > >>>>>>>> the > >>>>>>>>>>>> text > >>>>>>>>>>>>>>>> but > >>>>>>>>>>>>>>>>>>> not selectable (I mean when I try to highlight it with > >> the > >>>>>>>> mouse > >>>>>>>>>> to > >>>>>>>>>>>>>>>> copy > >>>>>>>>>>>>>>>>>>> it, it's not selectable!). When print the content > >> (tokens) > >>>> of > >>>>>>>> the > >>>>>>>>>>>>>>>> output > >>>>>>>>>>>>>>>>>>> file, I DO NOT find the text at all!! > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and > >>>> really > >>>>>>>>>>>> running > >>>>>>>>>>>>>>>> out > >>>>>>>>>>>>>>>>>>> of time. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's > >> going > >>>>>> on. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Notes: > >>>>>>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1 > >>>>>>>>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some > >>>> PDFs > >>>>>>>>>> cause > >>>>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>>>>> problem. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thank you very much. > >>>>>>>>>>>>>>>>>>> a7mad > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: > >> users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>>>>>>>> For additional commands, e-mail: > >>>> users-h...@pdfbox.apache.org > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>>>>>>>>>> To unsubscribe, e-mail: > >> users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>>>>>>> For additional commands, e-mail: > >>>> users-h...@pdfbox.apache.org > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>>>> For additional commands, e-mail: > users-h...@pdfbox.apache.org > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>> > >>>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>> > >>>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >