I mean how to find them in the PDF while rotating over the tokens, what is the operator?
On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > > > Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shr...@gmail.com>: > > > > What are the drawing commands? I'd then investigate one how to specify > the > > text ones. > > > > 738.7469 167.1278 m > 733.8743 167.1278 l > > > > > On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <sahy...@fileaffairs.de > > > > wrote: > > > >> > >>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shr...@gmail.com>: > >>> > >>> That's true, I've even tried to change the rendering text mode to other > >>> values already as mentioned in the PDF specs 1.5 table 5.3 before > >> removing > >>> it also didn't work. > >>> So how to remove the graphics content then? > >> > >> the simple answer - remove the drawing commands. > >> > >> The longer answer as you obviously don't want to remove all drawing > >> commands you'd need to find which are the ones drawing the text. As you > >> would like to remove certain vectors which are matching a certain > >> character/glyph you first need to find out which are the ones drawing > e.g. > >> the letter 'T'. I don't think that this is doable in a reasonable > amount of > >> time for arbitary text. > >> > >> Maruan > >> > >> > >>> > >>> Best Regards, > >>> > >>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun < > sahy...@fileaffairs.de > >>> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <a7med.shr...@gmail.com > >: > >>>>> > >>>>> You can download it from here: > >>>>> > >>>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing > >>>>> > >>>> > >>>> looking more closely you correctly replaced the text, but that text > was > >> in > >>>> there for searching within the PDF as it used text rendering mode 3 > >>>> (invisible). The 'text' you are still seeing is drawn using vector > >> commands > >>>> so it's graphics content. > >>>> > >>>> BR > >>>> Maruan > >>>> > >>>> > >>>>> Best Regards, > >>>>> > >>>>> > >>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun < > >> sahy...@fileaffairs.de> > >>>>> wrote: > >>>>> > >>>>>> > >>>>>> > >>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y < > a7med.shr...@gmail.com > >>> : > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" > to > >>>> "To > >>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or > >> decoding, I > >>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be > Approved" > >>>> and > >>>>>>> not the opposite (or at least I don't know). I spent some quite > long > >>>> time > >>>>>>> trying to find out how to find the character codes for the glyphs > in > >>>> the > >>>>>>> currently used font, then I found that it's not an easy task. By > the > >>>> way, > >>>>>>> if you know how to do that, I'd so much appreciate it because I > need > >>>> that > >>>>>>> for replacing text with another text and for that the new text must > >> be > >>>>>>> encoded the same way as the original! > >>>>>>> > >>>>>>> Back to the text removal, I am able to find the text and also > remove > >> it > >>>>>> by > >>>>>>> calling reset, as I mentioned in my first email, when I print the > >>>> output > >>>>>>> content I don't find the text anymore but I still see it when I > open > >>>> the > >>>>>>> file. My first assumption was that there must be some other way to > >>>> remove > >>>>>>> the text other than the way I am using, and that's what you've > >> actually > >>>>>>> confirmed in your reply, so could you please tell me what still > >>>> missing? > >>>>>>> > >>>>>> > >>>>>> Could you upload the PDF with the reset text too? > >>>>>> > >>>>>> BR > >>>>>> Maruan > >>>>>> > >>>>>> > >>>>>>> Thanks and regards, > >>>>>>> a7mad > >>>>>>> > >>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun < > >>>> sahy...@fileaffairs.de> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y < > >> a7med.shr...@gmail.com > >>>>> : > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> Here's how I do it: > >>>>>>>>> > >>>>>>>>> 1. I use the following method to encode the text: > >>>>>>>>> > >>>>>>>>> String encode(String text, PDFont font) throws Exception { > >>>>>>>>> StringBuilder builder = new StringBuilder(); > >>>>>>>>> byte[] stringBytes = text.getBytes(); > >>>>>>>>> int codeLength = 1; > >>>>>>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ > >>>>>>>>> String c = font.encode(stringBytes, i, codeLength); > >>>>>>>>> if(c == null && (i + 1 < stringBytes.length)){ > >>>>>>>>> codeLength++; > >>>>>>>>> c = font.encode(stringBytes, i, codeLength); > >>>>>>>>> } > >>>>>>>>> builder.append(c); > >>>>>>>>> } > >>>>>>>>> return builder.toString(); > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> 2. Iterating through the tokens, I find the text either it's a > >>>>>> COSString > >>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's > >> the > >>>>>> text > >>>>>>>>> I'm looking for to remove as following: > >>>>>>>>> > >>>>>>>>> if (op.getOperation().equals("Tj")) { > >>>>>>>>> COSString previous = (COSString) > >>>> tokens.get(j > >>>>>>>> - > >>>>>>>>> 1); > >>>>>>>>> String string = previous.getString(); > >>>>>>>>> String encodedString = encode(string, > >> font); > >>>>>>>> > >>>>>>>> that string is already encoded. So you'd need to encode "To Be > >>>> Approved" > >>>>>>>> and compare if that matches the string you are reading from the > PDF. > >>>>>>>> > >>>>>>>>> if(encodedString.contains("To Be > >>>> Approved")){ > >>>>>>>>> previous.reset(); > >>>>>>>>> } > >>>>>>>>> } else if (op.getOperation().equals("TJ")) { > >>>>>>>>> COSArray previous = (COSArray) > tokens.get(j > >>>> - > >>>>>>>>> 1); > >>>>>>>>> StringBuilder stringBuilder = new > >>>>>>>>> StringBuilder(); > >>>>>>>>> for (int k = 0; k < previous.size(); k++) > { > >>>>>>>>> Object arrElement = > >>>>>> previous.getObject(k); > >>>>>>>>> if (arrElement instanceof COSString) { > >>>>>>>>> COSString cosString = (COSString) > >>>>>>>>> arrElement; > >>>>>>>>> > >>>>>>>>> stringBuilder.append(cosString.getString()); > >>>>>>>>> } > >>>>>>>>> } > >>>>>>>>> String string = stringBuilder.toString(); > >>>>>>>>> String encodedString = encode(string, > >> font); > >>>>>>>>> if(encodedString.contains("To Be > >>>> Approved")){ > >>>>>>>>> previous.clear(); > >>>>>>>>> } > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> Note: > >>>>>>>>> In case of COSArray, I first iterate through the whole array to > get > >>>> the > >>>>>>>>> whole string before encoding and comparison and this works. > >>>>>>>>> > >>>>>>>>> Best Regards, > >>>>>>>>> a7mad > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < > >>>>>> sahy...@fileaffairs.de > >>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> your text is encoded so within the show text operator Tj the > >> string > >>>> is > >>>>>>>>>> > >>>>>>>>>> 7R %H $SSURYHG > >>>>>>>>>> > >>>>>>>>>> You wrote that you encode your string to find it - what do you > >> get? > >>>>>>>>>> > >>>>>>>>>> BR > >>>>>>>>>> Maruan > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y < > >>>> a7med.shr...@gmail.com > >>>>>>> : > >>>>>>>>>>> > >>>>>>>>>>> Hi Maruan, > >>>>>>>>>>> > >>>>>>>>>>> Here's a link from where you can download the PDF. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing > >>>>>>>>>>> > >>>>>>>>>>> Kind Regards, > >>>>>>>>>>> a7mad > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < > >>>>>>>> sahy...@fileaffairs.de> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> you need to upload it to a public location as the mailing list > >>>>>> doesn't > >>>>>>>>>>>> support attachments. > >>>>>>>>>>>> > >>>>>>>>>>>> BR > >>>>>>>>>>>> Maruan > >>>>>>>>>>>> > >>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < > >>>>>> a7med.shr...@gmail.com > >>>>>>>>> : > >>>>>>>>>>>>> > >>>>>>>>>>>>> Dear Maruan, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thank you very much for the information. Please find herewith > >>>>>>>> attached > >>>>>>>>>>>> the PDF to reproduce the problem. > >>>>>>>>>>>>> The text to remove is: "To Be Approved". The text has a > >>>> multi-byte > >>>>>>>>>>>> encoding, so I call first to encode it in order to find it > then > >>>>>> remove > >>>>>>>>>> it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>> a7mad > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < > >>>>>>>>>> sahy...@fileaffairs.de> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> Dear a7mad, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> removing text from a PDF is not an easy task as > >>>>>>>>>>>>>> - text which might visually appear as a single item might > >>>>>> consistent > >>>>>>>>>> of > >>>>>>>>>>>> individual parts within the PDF itself e.g. each character or > >>>> groups > >>>>>>>> of > >>>>>>>>>>>> characters are place individually in different COSStrings > >>>>>>>>>>>>>> - text might be drawn using graphics commands > >>>>>>>>>>>>>> - text can appear within different parts of the PDF (e.g. > the > >>>> text > >>>>>>>>>>>> might be content of a form field AND the annotation > representing > >>>> the > >>>>>>>>>> form > >>>>>>>>>>>> field visually) > >>>>>>>>>>>>>> - you need to look up the encoding information to get form > the > >>>>>>>>>>>> characters in the PDF "string" to the ones you are looking for > >>>>>>>>>>>>>> …. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> If you can post a specific PDF to a public location and > >> describe > >>>>>> in > >>>>>>>>>>>> detail which string should have been replaced which hasn't I > >> will > >>>> be > >>>>>>>>>> able > >>>>>>>>>>>> to tell you why that might have happened. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Maruan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < > >>>>>>>> a7med.shr...@gmail.com > >>>>>>>>>>> : > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Currently I am facing a strange problem removing text from > >> the > >>>>>> some > >>>>>>>>>>>> PDFs. > >>>>>>>>>>>>>>> My program is able to find the text and "remove it" by > >> calling > >>>>>> the > >>>>>>>>>>>>>>> COSString.reset() method. > >>>>>>>>>>>>>>> The problem is, when I open the output PDF file, I still > see > >>>> the > >>>>>>>> text > >>>>>>>>>>>> but > >>>>>>>>>>>>>>> not selectable (I mean when I try to highlight it with the > >>>> mouse > >>>>>> to > >>>>>>>>>>>> copy > >>>>>>>>>>>>>>> it, it's not selectable!). When print the content (tokens) > of > >>>> the > >>>>>>>>>>>> output > >>>>>>>>>>>>>>> file, I DO NOT find the text at all!! > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and > really > >>>>>>>> running > >>>>>>>>>>>> out > >>>>>>>>>>>>>>> of time. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's going > >> on. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Notes: > >>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1 > >>>>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some > PDFs > >>>>>> cause > >>>>>>>>>>>> this > >>>>>>>>>>>>>>> problem. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thank you very much. > >>>>>>>>>>>>>>> a7mad > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>>>> For additional commands, e-mail: > users-h...@pdfbox.apache.org > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>>>>> For additional commands, e-mail: > users-h...@pdfbox.apache.org > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>> > >>>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>> > >>>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >