> Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shr...@gmail.com>: > > What are the drawing commands? I'd then investigate one how to specify the > text ones. >
738.7469 167.1278 m 733.8743 167.1278 l > On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <sahy...@fileaffairs.de> > wrote: > >> >>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shr...@gmail.com>: >>> >>> That's true, I've even tried to change the rendering text mode to other >>> values already as mentioned in the PDF specs 1.5 table 5.3 before >> removing >>> it also didn't work. >>> So how to remove the graphics content then? >> >> the simple answer - remove the drawing commands. >> >> The longer answer as you obviously don't want to remove all drawing >> commands you'd need to find which are the ones drawing the text. As you >> would like to remove certain vectors which are matching a certain >> character/glyph you first need to find out which are the ones drawing e.g. >> the letter 'T'. I don't think that this is doable in a reasonable amount of >> time for arbitary text. >> >> Maruan >> >> >>> >>> Best Regards, >>> >>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <sahy...@fileaffairs.de >>> >>> wrote: >>> >>>> Hi, >>>> >>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <a7med.shr...@gmail.com>: >>>>> >>>>> You can download it from here: >>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing >>>>> >>>> >>>> looking more closely you correctly replaced the text, but that text was >> in >>>> there for searching within the PDF as it used text rendering mode 3 >>>> (invisible). The 'text' you are still seeing is drawn using vector >> commands >>>> so it's graphics content. >>>> >>>> BR >>>> Maruan >>>> >>>> >>>>> Best Regards, >>>>> >>>>> >>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun < >> sahy...@fileaffairs.de> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <a7med.shr...@gmail.com >>> : >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to >>>> "To >>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or >> decoding, I >>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" >>>> and >>>>>>> not the opposite (or at least I don't know). I spent some quite long >>>> time >>>>>>> trying to find out how to find the character codes for the glyphs in >>>> the >>>>>>> currently used font, then I found that it's not an easy task. By the >>>> way, >>>>>>> if you know how to do that, I'd so much appreciate it because I need >>>> that >>>>>>> for replacing text with another text and for that the new text must >> be >>>>>>> encoded the same way as the original! >>>>>>> >>>>>>> Back to the text removal, I am able to find the text and also remove >> it >>>>>> by >>>>>>> calling reset, as I mentioned in my first email, when I print the >>>> output >>>>>>> content I don't find the text anymore but I still see it when I open >>>> the >>>>>>> file. My first assumption was that there must be some other way to >>>> remove >>>>>>> the text other than the way I am using, and that's what you've >> actually >>>>>>> confirmed in your reply, so could you please tell me what still >>>> missing? >>>>>>> >>>>>> >>>>>> Could you upload the PDF with the reset text too? >>>>>> >>>>>> BR >>>>>> Maruan >>>>>> >>>>>> >>>>>>> Thanks and regards, >>>>>>> a7mad >>>>>>> >>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun < >>>> sahy...@fileaffairs.de> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y < >> a7med.shr...@gmail.com >>>>> : >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Here's how I do it: >>>>>>>>> >>>>>>>>> 1. I use the following method to encode the text: >>>>>>>>> >>>>>>>>> String encode(String text, PDFont font) throws Exception { >>>>>>>>> StringBuilder builder = new StringBuilder(); >>>>>>>>> byte[] stringBytes = text.getBytes(); >>>>>>>>> int codeLength = 1; >>>>>>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ >>>>>>>>> String c = font.encode(stringBytes, i, codeLength); >>>>>>>>> if(c == null && (i + 1 < stringBytes.length)){ >>>>>>>>> codeLength++; >>>>>>>>> c = font.encode(stringBytes, i, codeLength); >>>>>>>>> } >>>>>>>>> builder.append(c); >>>>>>>>> } >>>>>>>>> return builder.toString(); >>>>>>>>> } >>>>>>>>> >>>>>>>>> 2. Iterating through the tokens, I find the text either it's a >>>>>> COSString >>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's >> the >>>>>> text >>>>>>>>> I'm looking for to remove as following: >>>>>>>>> >>>>>>>>> if (op.getOperation().equals("Tj")) { >>>>>>>>> COSString previous = (COSString) >>>> tokens.get(j >>>>>>>> - >>>>>>>>> 1); >>>>>>>>> String string = previous.getString(); >>>>>>>>> String encodedString = encode(string, >> font); >>>>>>>> >>>>>>>> that string is already encoded. So you'd need to encode "To Be >>>> Approved" >>>>>>>> and compare if that matches the string you are reading from the PDF. >>>>>>>> >>>>>>>>> if(encodedString.contains("To Be >>>> Approved")){ >>>>>>>>> previous.reset(); >>>>>>>>> } >>>>>>>>> } else if (op.getOperation().equals("TJ")) { >>>>>>>>> COSArray previous = (COSArray) tokens.get(j >>>> - >>>>>>>>> 1); >>>>>>>>> StringBuilder stringBuilder = new >>>>>>>>> StringBuilder(); >>>>>>>>> for (int k = 0; k < previous.size(); k++) { >>>>>>>>> Object arrElement = >>>>>> previous.getObject(k); >>>>>>>>> if (arrElement instanceof COSString) { >>>>>>>>> COSString cosString = (COSString) >>>>>>>>> arrElement; >>>>>>>>> >>>>>>>>> stringBuilder.append(cosString.getString()); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> String string = stringBuilder.toString(); >>>>>>>>> String encodedString = encode(string, >> font); >>>>>>>>> if(encodedString.contains("To Be >>>> Approved")){ >>>>>>>>> previous.clear(); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> Note: >>>>>>>>> In case of COSArray, I first iterate through the whole array to get >>>> the >>>>>>>>> whole string before encoding and comparison and this works. >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> a7mad >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < >>>>>> sahy...@fileaffairs.de >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> your text is encoded so within the show text operator Tj the >> string >>>> is >>>>>>>>>> >>>>>>>>>> 7R %H $SSURYHG >>>>>>>>>> >>>>>>>>>> You wrote that you encode your string to find it - what do you >> get? >>>>>>>>>> >>>>>>>>>> BR >>>>>>>>>> Maruan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y < >>>> a7med.shr...@gmail.com >>>>>>> : >>>>>>>>>>> >>>>>>>>>>> Hi Maruan, >>>>>>>>>>> >>>>>>>>>>> Here's a link from where you can download the PDF. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing >>>>>>>>>>> >>>>>>>>>>> Kind Regards, >>>>>>>>>>> a7mad >>>>>>>>>>> >>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < >>>>>>>> sahy...@fileaffairs.de> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> you need to upload it to a public location as the mailing list >>>>>> doesn't >>>>>>>>>>>> support attachments. >>>>>>>>>>>> >>>>>>>>>>>> BR >>>>>>>>>>>> Maruan >>>>>>>>>>>> >>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < >>>>>> a7med.shr...@gmail.com >>>>>>>>> : >>>>>>>>>>>>> >>>>>>>>>>>>> Dear Maruan, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you very much for the information. Please find herewith >>>>>>>> attached >>>>>>>>>>>> the PDF to reproduce the problem. >>>>>>>>>>>>> The text to remove is: "To Be Approved". The text has a >>>> multi-byte >>>>>>>>>>>> encoding, so I call first to encode it in order to find it then >>>>>> remove >>>>>>>>>> it. >>>>>>>>>>>>> >>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>> a7mad >>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < >>>>>>>>>> sahy...@fileaffairs.de> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>> Dear a7mad, >>>>>>>>>>>>>> >>>>>>>>>>>>>> removing text from a PDF is not an easy task as >>>>>>>>>>>>>> - text which might visually appear as a single item might >>>>>> consistent >>>>>>>>>> of >>>>>>>>>>>> individual parts within the PDF itself e.g. each character or >>>> groups >>>>>>>> of >>>>>>>>>>>> characters are place individually in different COSStrings >>>>>>>>>>>>>> - text might be drawn using graphics commands >>>>>>>>>>>>>> - text can appear within different parts of the PDF (e.g. the >>>> text >>>>>>>>>>>> might be content of a form field AND the annotation representing >>>> the >>>>>>>>>> form >>>>>>>>>>>> field visually) >>>>>>>>>>>>>> - you need to look up the encoding information to get form the >>>>>>>>>>>> characters in the PDF "string" to the ones you are looking for >>>>>>>>>>>>>> …. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If you can post a specific PDF to a public location and >> describe >>>>>> in >>>>>>>>>>>> detail which string should have been replaced which hasn't I >> will >>>> be >>>>>>>>>> able >>>>>>>>>>>> to tell you why that might have happened. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Maruan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < >>>>>>>> a7med.shr...@gmail.com >>>>>>>>>>> : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Currently I am facing a strange problem removing text from >> the >>>>>> some >>>>>>>>>>>> PDFs. >>>>>>>>>>>>>>> My program is able to find the text and "remove it" by >> calling >>>>>> the >>>>>>>>>>>>>>> COSString.reset() method. >>>>>>>>>>>>>>> The problem is, when I open the output PDF file, I still see >>>> the >>>>>>>> text >>>>>>>>>>>> but >>>>>>>>>>>>>>> not selectable (I mean when I try to highlight it with the >>>> mouse >>>>>> to >>>>>>>>>>>> copy >>>>>>>>>>>>>>> it, it's not selectable!). When print the content (tokens) of >>>> the >>>>>>>>>>>> output >>>>>>>>>>>>>>> file, I DO NOT find the text at all!! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and really >>>>>>>> running >>>>>>>>>>>> out >>>>>>>>>>>>>>> of time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's going >> on. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Notes: >>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1 >>>>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs >>>>>> cause >>>>>>>>>>>> this >>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you very much. >>>>>>>>>>>>>>> a7mad >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>> >>>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org