Hi, > Am 24.03.2015 um 09:55 schrieb a7med shre3y <a7med.shr...@gmail.com>: > > You can download it from here: > https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing >
looking more closely you correctly replaced the text, but that text was in there for searching within the PDF as it used text rendering mode 3 (invisible). The 'text' you are still seeing is drawn using vector commands so it's graphics content. BR Maruan > Best Regards, > > > On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <sahy...@fileaffairs.de> > wrote: > >> >> >>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <a7med.shr...@gmail.com>: >>> >>> Hi, >>> >>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To >>> Be Approved" as "encoding". Anyway, either it's encoding or decoding, I >>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and >>> not the opposite (or at least I don't know). I spent some quite long time >>> trying to find out how to find the character codes for the glyphs in the >>> currently used font, then I found that it's not an easy task. By the way, >>> if you know how to do that, I'd so much appreciate it because I need that >>> for replacing text with another text and for that the new text must be >>> encoded the same way as the original! >>> >>> Back to the text removal, I am able to find the text and also remove it >> by >>> calling reset, as I mentioned in my first email, when I print the output >>> content I don't find the text anymore but I still see it when I open the >>> file. My first assumption was that there must be some other way to remove >>> the text other than the way I am using, and that's what you've actually >>> confirmed in your reply, so could you please tell me what still missing? >>> >> >> Could you upload the PDF with the reset text too? >> >> BR >> Maruan >> >> >>> Thanks and regards, >>> a7mad >>> >>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <sahy...@fileaffairs.de> >>> wrote: >>> >>>> Hi, >>>> >>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <a7med.shr...@gmail.com>: >>>>> >>>>> Hi, >>>>> >>>>> Here's how I do it: >>>>> >>>>> 1. I use the following method to encode the text: >>>>> >>>>> String encode(String text, PDFont font) throws Exception { >>>>> StringBuilder builder = new StringBuilder(); >>>>> byte[] stringBytes = text.getBytes(); >>>>> int codeLength = 1; >>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ >>>>> String c = font.encode(stringBytes, i, codeLength); >>>>> if(c == null && (i + 1 < stringBytes.length)){ >>>>> codeLength++; >>>>> c = font.encode(stringBytes, i, codeLength); >>>>> } >>>>> builder.append(c); >>>>> } >>>>> return builder.toString(); >>>>> } >>>>> >>>>> 2. Iterating through the tokens, I find the text either it's a >> COSString >>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the >> text >>>>> I'm looking for to remove as following: >>>>> >>>>> if (op.getOperation().equals("Tj")) { >>>>> COSString previous = (COSString) tokens.get(j >>>> - >>>>> 1); >>>>> String string = previous.getString(); >>>>> String encodedString = encode(string, font); >>>> >>>> that string is already encoded. So you'd need to encode "To Be Approved" >>>> and compare if that matches the string you are reading from the PDF. >>>> >>>>> if(encodedString.contains("To Be Approved")){ >>>>> previous.reset(); >>>>> } >>>>> } else if (op.getOperation().equals("TJ")) { >>>>> COSArray previous = (COSArray) tokens.get(j - >>>>> 1); >>>>> StringBuilder stringBuilder = new >>>>> StringBuilder(); >>>>> for (int k = 0; k < previous.size(); k++) { >>>>> Object arrElement = >> previous.getObject(k); >>>>> if (arrElement instanceof COSString) { >>>>> COSString cosString = (COSString) >>>>> arrElement; >>>>> >>>>> stringBuilder.append(cosString.getString()); >>>>> } >>>>> } >>>>> String string = stringBuilder.toString(); >>>>> String encodedString = encode(string, font); >>>>> if(encodedString.contains("To Be Approved")){ >>>>> previous.clear(); >>>>> } >>>>> } >>>>> >>>>> Note: >>>>> In case of COSArray, I first iterate through the whole array to get the >>>>> whole string before encoding and comparison and this works. >>>>> >>>>> Best Regards, >>>>> a7mad >>>>> >>>>> >>>>> >>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < >> sahy...@fileaffairs.de >>>>> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> your text is encoded so within the show text operator Tj the string is >>>>>> >>>>>> 7R %H $SSURYHG >>>>>> >>>>>> You wrote that you encode your string to find it - what do you get? >>>>>> >>>>>> BR >>>>>> Maruan >>>>>> >>>>>> >>>>>> >>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <a7med.shr...@gmail.com >>> : >>>>>>> >>>>>>> Hi Maruan, >>>>>>> >>>>>>> Here's a link from where you can download the PDF. >>>>>>> >>>>>>> >>>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing >>>>>>> >>>>>>> Kind Regards, >>>>>>> a7mad >>>>>>> >>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < >>>> sahy...@fileaffairs.de> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> you need to upload it to a public location as the mailing list >> doesn't >>>>>>>> support attachments. >>>>>>>> >>>>>>>> BR >>>>>>>> Maruan >>>>>>>> >>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < >> a7med.shr...@gmail.com >>>>> : >>>>>>>>> >>>>>>>>> Dear Maruan, >>>>>>>>> >>>>>>>>> Thank you very much for the information. Please find herewith >>>> attached >>>>>>>> the PDF to reproduce the problem. >>>>>>>>> The text to remove is: "To Be Approved". The text has a multi-byte >>>>>>>> encoding, so I call first to encode it in order to find it then >> remove >>>>>> it. >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> a7mad >>>>>>>>> >>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < >>>>>> sahy...@fileaffairs.de> >>>>>>>> wrote: >>>>>>>>>> Dear a7mad, >>>>>>>>>> >>>>>>>>>> removing text from a PDF is not an easy task as >>>>>>>>>> - text which might visually appear as a single item might >> consistent >>>>>> of >>>>>>>> individual parts within the PDF itself e.g. each character or groups >>>> of >>>>>>>> characters are place individually in different COSStrings >>>>>>>>>> - text might be drawn using graphics commands >>>>>>>>>> - text can appear within different parts of the PDF (e.g. the text >>>>>>>> might be content of a form field AND the annotation representing the >>>>>> form >>>>>>>> field visually) >>>>>>>>>> - you need to look up the encoding information to get form the >>>>>>>> characters in the PDF "string" to the ones you are looking for >>>>>>>>>> …. >>>>>>>>>> >>>>>>>>>> If you can post a specific PDF to a public location and describe >> in >>>>>>>> detail which string should have been replaced which hasn't I will be >>>>>> able >>>>>>>> to tell you why that might have happened. >>>>>>>>>> >>>>>>>>>> Maruan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < >>>> a7med.shr...@gmail.com >>>>>>> : >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> Currently I am facing a strange problem removing text from the >> some >>>>>>>> PDFs. >>>>>>>>>>> My program is able to find the text and "remove it" by calling >> the >>>>>>>>>>> COSString.reset() method. >>>>>>>>>>> The problem is, when I open the output PDF file, I still see the >>>> text >>>>>>>> but >>>>>>>>>>> not selectable (I mean when I try to highlight it with the mouse >> to >>>>>>>> copy >>>>>>>>>>> it, it's not selectable!). When print the content (tokens) of the >>>>>>>> output >>>>>>>>>>> file, I DO NOT find the text at all!! >>>>>>>>>>> >>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and really >>>> running >>>>>>>> out >>>>>>>>>>> of time. >>>>>>>>>>> >>>>>>>>>>> I'd so much appreciate any help or any idea on what's going on. >>>>>>>>>>> >>>>>>>>>>> Notes: >>>>>>>>>>> 1. I use use PDFBox 1.7.1 >>>>>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs >> cause >>>>>>>> this >>>>>>>>>>> problem. >>>>>>>>>>> >>>>>>>>>>> Thank you very much. >>>>>>>>>>> a7mad >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>> >>>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org