> Am 24.03.2015 um 12:49 schrieb a7med shre3y <a7med.shr...@gmail.com>: > > The question here is how does the text still show up in the output file???
as written earlier before the 'text' is a drawing i.e. vector graphics the same way the ellipses have been drawn. > I assume the text should have been cached somewhere else in the PDF! I > don't know if my assumption is correct, do you have any explanation for > that? > > On Tue, Mar 24, 2015 at 10:46 AM, Maruan Sahyoun <sahy...@fileaffairs.de> > wrote: > >> >>> Am 24.03.2015 um 10:43 schrieb a7med shre3y <a7med.shr...@gmail.com>: >>> >>> I mean how to find them in the PDF while rotating over the tokens, what >> is >>> the operator? >>> >>> On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun <sahy...@fileaffairs.de >>> >>> wrote: >>> >>>> >>>>> Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shr...@gmail.com>: >>>>> >>>>> What are the drawing commands? I'd then investigate one how to specify >>>> the >>>>> text ones. >>>>> >>>> >>>> 738.7469 167.1278 m >> >> MoveTo >> >>>> 733.8743 167.1278 l >>>> >> >> LineTo >> >> >>>> >>>> >>>>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun < >> sahy...@fileaffairs.de >>>>> >>>>> wrote: >>>>> >>>>>> >>>>>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shr...@gmail.com >>> : >>>>>>> >>>>>>> That's true, I've even tried to change the rendering text mode to >> other >>>>>>> values already as mentioned in the PDF specs 1.5 table 5.3 before >>>>>> removing >>>>>>> it also didn't work. >>>>>>> So how to remove the graphics content then? >>>>>> >>>>>> the simple answer - remove the drawing commands. >>>>>> >>>>>> The longer answer as you obviously don't want to remove all drawing >>>>>> commands you'd need to find which are the ones drawing the text. As >> you >>>>>> would like to remove certain vectors which are matching a certain >>>>>> character/glyph you first need to find out which are the ones drawing >>>> e.g. >>>>>> the letter 'T'. I don't think that this is doable in a reasonable >>>> amount of >>>>>> time for arbitary text. >>>>>> >>>>>> Maruan >>>>>> >>>>>> >>>>>>> >>>>>>> Best Regards, >>>>>>> >>>>>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun < >>>> sahy...@fileaffairs.de >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y < >> a7med.shr...@gmail.com >>>>> : >>>>>>>>> >>>>>>>>> You can download it from here: >>>>>>>>> >>>>>>>> >>>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing >>>>>>>>> >>>>>>>> >>>>>>>> looking more closely you correctly replaced the text, but that text >>>> was >>>>>> in >>>>>>>> there for searching within the PDF as it used text rendering mode 3 >>>>>>>> (invisible). The 'text' you are still seeing is drawn using vector >>>>>> commands >>>>>>>> so it's graphics content. >>>>>>>> >>>>>>>> BR >>>>>>>> Maruan >>>>>>>> >>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun < >>>>>> sahy...@fileaffairs.de> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y < >>>> a7med.shr...@gmail.com >>>>>>> : >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> In fact PDFBox call the operation of transforming "7R %H >> $SSURYHG" >>>> to >>>>>>>> "To >>>>>>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or >>>>>> decoding, I >>>>>>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be >>>> Approved" >>>>>>>> and >>>>>>>>>>> not the opposite (or at least I don't know). I spent some quite >>>> long >>>>>>>> time >>>>>>>>>>> trying to find out how to find the character codes for the glyphs >>>> in >>>>>>>> the >>>>>>>>>>> currently used font, then I found that it's not an easy task. By >>>> the >>>>>>>> way, >>>>>>>>>>> if you know how to do that, I'd so much appreciate it because I >>>> need >>>>>>>> that >>>>>>>>>>> for replacing text with another text and for that the new text >> must >>>>>> be >>>>>>>>>>> encoded the same way as the original! >>>>>>>>>>> >>>>>>>>>>> Back to the text removal, I am able to find the text and also >>>> remove >>>>>> it >>>>>>>>>> by >>>>>>>>>>> calling reset, as I mentioned in my first email, when I print the >>>>>>>> output >>>>>>>>>>> content I don't find the text anymore but I still see it when I >>>> open >>>>>>>> the >>>>>>>>>>> file. My first assumption was that there must be some other way >> to >>>>>>>> remove >>>>>>>>>>> the text other than the way I am using, and that's what you've >>>>>> actually >>>>>>>>>>> confirmed in your reply, so could you please tell me what still >>>>>>>> missing? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Could you upload the PDF with the reset text too? >>>>>>>>>> >>>>>>>>>> BR >>>>>>>>>> Maruan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Thanks and regards, >>>>>>>>>>> a7mad >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun < >>>>>>>> sahy...@fileaffairs.de> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y < >>>>>> a7med.shr...@gmail.com >>>>>>>>> : >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Here's how I do it: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. I use the following method to encode the text: >>>>>>>>>>>>> >>>>>>>>>>>>> String encode(String text, PDFont font) throws Exception { >>>>>>>>>>>>> StringBuilder builder = new StringBuilder(); >>>>>>>>>>>>> byte[] stringBytes = text.getBytes(); >>>>>>>>>>>>> int codeLength = 1; >>>>>>>>>>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ >>>>>>>>>>>>> String c = font.encode(stringBytes, i, codeLength); >>>>>>>>>>>>> if(c == null && (i + 1 < stringBytes.length)){ >>>>>>>>>>>>> codeLength++; >>>>>>>>>>>>> c = font.encode(stringBytes, i, codeLength); >>>>>>>>>>>>> } >>>>>>>>>>>>> builder.append(c); >>>>>>>>>>>>> } >>>>>>>>>>>>> return builder.toString(); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> 2. Iterating through the tokens, I find the text either it's a >>>>>>>>>> COSString >>>>>>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if >> it's >>>>>> the >>>>>>>>>> text >>>>>>>>>>>>> I'm looking for to remove as following: >>>>>>>>>>>>> >>>>>>>>>>>>> if (op.getOperation().equals("Tj")) { >>>>>>>>>>>>> COSString previous = (COSString) >>>>>>>> tokens.get(j >>>>>>>>>>>> - >>>>>>>>>>>>> 1); >>>>>>>>>>>>> String string = previous.getString(); >>>>>>>>>>>>> String encodedString = encode(string, >>>>>> font); >>>>>>>>>>>> >>>>>>>>>>>> that string is already encoded. So you'd need to encode "To Be >>>>>>>> Approved" >>>>>>>>>>>> and compare if that matches the string you are reading from the >>>> PDF. >>>>>>>>>>>> >>>>>>>>>>>>> if(encodedString.contains("To Be >>>>>>>> Approved")){ >>>>>>>>>>>>> previous.reset(); >>>>>>>>>>>>> } >>>>>>>>>>>>> } else if (op.getOperation().equals("TJ")) { >>>>>>>>>>>>> COSArray previous = (COSArray) >>>> tokens.get(j >>>>>>>> - >>>>>>>>>>>>> 1); >>>>>>>>>>>>> StringBuilder stringBuilder = new >>>>>>>>>>>>> StringBuilder(); >>>>>>>>>>>>> for (int k = 0; k < previous.size(); k++) >>>> { >>>>>>>>>>>>> Object arrElement = >>>>>>>>>> previous.getObject(k); >>>>>>>>>>>>> if (arrElement instanceof COSString) >> { >>>>>>>>>>>>> COSString cosString = (COSString) >>>>>>>>>>>>> arrElement; >>>>>>>>>>>>> >>>>>>>>>>>>> stringBuilder.append(cosString.getString()); >>>>>>>>>>>>> } >>>>>>>>>>>>> } >>>>>>>>>>>>> String string = stringBuilder.toString(); >>>>>>>>>>>>> String encodedString = encode(string, >>>>>> font); >>>>>>>>>>>>> if(encodedString.contains("To Be >>>>>>>> Approved")){ >>>>>>>>>>>>> previous.clear(); >>>>>>>>>>>>> } >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> Note: >>>>>>>>>>>>> In case of COSArray, I first iterate through the whole array to >>>> get >>>>>>>> the >>>>>>>>>>>>> whole string before encoding and comparison and this works. >>>>>>>>>>>>> >>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>> a7mad >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < >>>>>>>>>> sahy...@fileaffairs.de >>>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> your text is encoded so within the show text operator Tj the >>>>>> string >>>>>>>> is >>>>>>>>>>>>>> >>>>>>>>>>>>>> 7R %H $SSURYHG >>>>>>>>>>>>>> >>>>>>>>>>>>>> You wrote that you encode your string to find it - what do you >>>>>> get? >>>>>>>>>>>>>> >>>>>>>>>>>>>> BR >>>>>>>>>>>>>> Maruan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y < >>>>>>>> a7med.shr...@gmail.com >>>>>>>>>>> : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Maruan, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here's a link from where you can download the PDF. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>>>> a7mad >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < >>>>>>>>>>>> sahy...@fileaffairs.de> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> you need to upload it to a public location as the mailing >> list >>>>>>>>>> doesn't >>>>>>>>>>>>>>>> support attachments. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> BR >>>>>>>>>>>>>>>> Maruan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < >>>>>>>>>> a7med.shr...@gmail.com >>>>>>>>>>>>> : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Dear Maruan, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you very much for the information. Please find >> herewith >>>>>>>>>>>> attached >>>>>>>>>>>>>>>> the PDF to reproduce the problem. >>>>>>>>>>>>>>>>> The text to remove is: "To Be Approved". The text has a >>>>>>>> multi-byte >>>>>>>>>>>>>>>> encoding, so I call first to encode it in order to find it >>>> then >>>>>>>>>> remove >>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>> a7mad >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < >>>>>>>>>>>>>> sahy...@fileaffairs.de> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> Dear a7mad, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> removing text from a PDF is not an easy task as >>>>>>>>>>>>>>>>>> - text which might visually appear as a single item might >>>>>>>>>> consistent >>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> individual parts within the PDF itself e.g. each character >> or >>>>>>>> groups >>>>>>>>>>>> of >>>>>>>>>>>>>>>> characters are place individually in different COSStrings >>>>>>>>>>>>>>>>>> - text might be drawn using graphics commands >>>>>>>>>>>>>>>>>> - text can appear within different parts of the PDF (e.g. >>>> the >>>>>>>> text >>>>>>>>>>>>>>>> might be content of a form field AND the annotation >>>> representing >>>>>>>> the >>>>>>>>>>>>>> form >>>>>>>>>>>>>>>> field visually) >>>>>>>>>>>>>>>>>> - you need to look up the encoding information to get form >>>> the >>>>>>>>>>>>>>>> characters in the PDF "string" to the ones you are looking >> for >>>>>>>>>>>>>>>>>> …. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If you can post a specific PDF to a public location and >>>>>> describe >>>>>>>>>> in >>>>>>>>>>>>>>>> detail which string should have been replaced which hasn't I >>>>>> will >>>>>>>> be >>>>>>>>>>>>>> able >>>>>>>>>>>>>>>> to tell you why that might have happened. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Maruan >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < >>>>>>>>>>>> a7med.shr...@gmail.com >>>>>>>>>>>>>>> : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Currently I am facing a strange problem removing text >> from >>>>>> the >>>>>>>>>> some >>>>>>>>>>>>>>>> PDFs. >>>>>>>>>>>>>>>>>>> My program is able to find the text and "remove it" by >>>>>> calling >>>>>>>>>> the >>>>>>>>>>>>>>>>>>> COSString.reset() method. >>>>>>>>>>>>>>>>>>> The problem is, when I open the output PDF file, I still >>>> see >>>>>>>> the >>>>>>>>>>>> text >>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>> not selectable (I mean when I try to highlight it with >> the >>>>>>>> mouse >>>>>>>>>> to >>>>>>>>>>>>>>>> copy >>>>>>>>>>>>>>>>>>> it, it's not selectable!). When print the content >> (tokens) >>>> of >>>>>>>> the >>>>>>>>>>>>>>>> output >>>>>>>>>>>>>>>>>>> file, I DO NOT find the text at all!! >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and >>>> really >>>>>>>>>>>> running >>>>>>>>>>>>>>>> out >>>>>>>>>>>>>>>>>>> of time. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's >> going >>>>>> on. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Notes: >>>>>>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1 >>>>>>>>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some >>>> PDFs >>>>>>>>>> cause >>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thank you very much. >>>>>>>>>>>>>>>>>>> a7mad >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: >> users-unsubscr...@pdfbox.apache.org >>>>>>>>>>>>>>>>>> For additional commands, e-mail: >>>> users-h...@pdfbox.apache.org >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>>>> To unsubscribe, e-mail: >> users-unsubscr...@pdfbox.apache.org >>>>>>>>>>>>>>>>> For additional commands, e-mail: >>>> users-h...@pdfbox.apache.org >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>> >>>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org