Re: Text removal

Maruan Sahyoun Tue, 24 Mar 2015 05:34:18 -0700

> Am 24.03.2015 um 12:49 schrieb a7med shre3y <a7med.shr...@gmail.com>:
> 
> The question here is how does the text still show up in the output file???


as written earlier before the 'text' is a drawing i.e. vector graphics the same 
way the ellipses have been drawn.


> I assume the text should have been cached somewhere else in the PDF! I
> don't know if my assumption is correct, do you have any explanation for
> that?
> 
> On Tue, Mar 24, 2015 at 10:46 AM, Maruan Sahyoun <sahy...@fileaffairs.de>
> wrote:
> 
>> 
>>> Am 24.03.2015 um 10:43 schrieb a7med shre3y <a7med.shr...@gmail.com>:
>>> 
>>> I mean how to find them in the PDF while rotating over the tokens, what
>> is
>>> the operator?
>>> 
>>> On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun <sahy...@fileaffairs.de
>>> 
>>> wrote:
>>> 
>>>> 
>>>>> Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shr...@gmail.com>:
>>>>> 
>>>>> What are the drawing commands? I'd then investigate one how to specify
>>>> the
>>>>> text ones.
>>>>> 
>>>> 
>>>> 738.7469 167.1278 m
>> 
>> MoveTo
>> 
>>>> 733.8743 167.1278 l
>>>> 
>> 
>> LineTo
>> 
>> 
>>>> 
>>>> 
>>>>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <
>> sahy...@fileaffairs.de
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shr...@gmail.com
>>> :
>>>>>>> 
>>>>>>> That's true, I've even tried to change the rendering text mode to
>> other
>>>>>>> values already as mentioned in the PDF specs 1.5 table 5.3 before
>>>>>> removing
>>>>>>> it also didn't work.
>>>>>>> So how to remove the graphics content then?
>>>>>> 
>>>>>> the simple answer - remove the drawing commands.
>>>>>> 
>>>>>> The longer answer as you obviously don't want to remove all drawing
>>>>>> commands you'd need to find which are the ones drawing the text. As
>> you
>>>>>> would like to remove certain vectors which are matching a certain
>>>>>> character/glyph you first need to find out which are the ones drawing
>>>> e.g.
>>>>>> the letter 'T'. I don't think that this is doable in a reasonable
>>>> amount of
>>>>>> time for arbitary text.
>>>>>> 
>>>>>> Maruan
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> 
>>>>>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <
>>>> sahy...@fileaffairs.de
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <
>> a7med.shr...@gmail.com
>>>>> :
>>>>>>>>> 
>>>>>>>>> You can download it from here:
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> looking more closely you correctly replaced the text, but that text
>>>> was
>>>>>> in
>>>>>>>> there for searching within the PDF as it used text rendering mode 3
>>>>>>>> (invisible). The 'text' you are still seeing is drawn using vector
>>>>>> commands
>>>>>>>> so it's graphics content.
>>>>>>>> 
>>>>>>>> BR
>>>>>>>> Maruan
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
>>>>>> sahy...@fileaffairs.de>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <
>>>> a7med.shr...@gmail.com
>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> In fact PDFBox call the operation of transforming "7R %H
>> $SSURYHG"
>>>> to
>>>>>>>> "To
>>>>>>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or
>>>>>> decoding, I
>>>>>>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be
>>>> Approved"
>>>>>>>> and
>>>>>>>>>>> not the opposite (or at least I don't know). I spent some quite
>>>> long
>>>>>>>> time
>>>>>>>>>>> trying to find out how to find the character codes for the glyphs
>>>> in
>>>>>>>> the
>>>>>>>>>>> currently used font, then I found that it's not an easy task. By
>>>> the
>>>>>>>> way,
>>>>>>>>>>> if you know how to do that, I'd so much appreciate it because I
>>>> need
>>>>>>>> that
>>>>>>>>>>> for replacing text with another text and for that the new text
>> must
>>>>>> be
>>>>>>>>>>> encoded the same way as the original!
>>>>>>>>>>> 
>>>>>>>>>>> Back to the text removal, I am able to find the text and also
>>>> remove
>>>>>> it
>>>>>>>>>> by
>>>>>>>>>>> calling reset, as I mentioned in my first email, when I print the
>>>>>>>> output
>>>>>>>>>>> content I don't find the text anymore but I still see it when I
>>>> open
>>>>>>>> the
>>>>>>>>>>> file. My first assumption was that there must be some other way
>> to
>>>>>>>> remove
>>>>>>>>>>> the text other than the way I am using, and that's what you've
>>>>>> actually
>>>>>>>>>>> confirmed in your reply, so could you please tell me what still
>>>>>>>> missing?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Could you upload the PDF with the reset text too?
>>>>>>>>>> 
>>>>>>>>>> BR
>>>>>>>>>> Maruan
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>> a7mad
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
>>>>>>>> sahy...@fileaffairs.de>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
>>>>>> a7med.shr...@gmail.com
>>>>>>>>> :
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here's how I do it:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. I use the following method to encode the text:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> String encode(String text, PDFont font) throws Exception {
>>>>>>>>>>>>>  StringBuilder builder = new StringBuilder();
>>>>>>>>>>>>>  byte[] stringBytes = text.getBytes();
>>>>>>>>>>>>>  int codeLength = 1;
>>>>>>>>>>>>>  for(int i = 0; i < stringBytes.length; i += codeLength){
>>>>>>>>>>>>>          String c = font.encode(stringBytes, i, codeLength);
>>>>>>>>>>>>>          if(c == null && (i + 1 < stringBytes.length)){
>>>>>>>>>>>>>              codeLength++;
>>>>>>>>>>>>>              c = font.encode(stringBytes, i, codeLength);
>>>>>>>>>>>>>          }
>>>>>>>>>>>>>          builder.append(c);
>>>>>>>>>>>>>      }
>>>>>>>>>>>>>  return builder.toString();
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2. Iterating through the tokens, I find the text either it's a
>>>>>>>>>> COSString
>>>>>>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if
>> it's
>>>>>> the
>>>>>>>>>> text
>>>>>>>>>>>>> I'm looking for to remove as following:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> if (op.getOperation().equals("Tj")) {
>>>>>>>>>>>>>                      COSString previous = (COSString)
>>>>>>>> tokens.get(j
>>>>>>>>>>>> -
>>>>>>>>>>>>> 1);
>>>>>>>>>>>>>                      String string = previous.getString();
>>>>>>>>>>>>>                      String encodedString = encode(string,
>>>>>> font);
>>>>>>>>>>>> 
>>>>>>>>>>>> that string is already encoded. So you'd need to encode "To Be
>>>>>>>> Approved"
>>>>>>>>>>>> and compare if that matches the string you are reading from the
>>>> PDF.
>>>>>>>>>>>> 
>>>>>>>>>>>>>                      if(encodedString.contains("To Be
>>>>>>>> Approved")){
>>>>>>>>>>>>>                          previous.reset();
>>>>>>>>>>>>>                      }
>>>>>>>>>>>>>                  } else if (op.getOperation().equals("TJ")) {
>>>>>>>>>>>>>                      COSArray previous = (COSArray)
>>>> tokens.get(j
>>>>>>>> -
>>>>>>>>>>>>> 1);
>>>>>>>>>>>>>                      StringBuilder stringBuilder = new
>>>>>>>>>>>>> StringBuilder();
>>>>>>>>>>>>>                      for (int k = 0; k < previous.size(); k++)
>>>> {
>>>>>>>>>>>>>                          Object arrElement =
>>>>>>>>>> previous.getObject(k);
>>>>>>>>>>>>>                          if (arrElement instanceof COSString)
>> {
>>>>>>>>>>>>>                              COSString cosString = (COSString)
>>>>>>>>>>>>> arrElement;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> stringBuilder.append(cosString.getString());
>>>>>>>>>>>>>                          }
>>>>>>>>>>>>>                      }
>>>>>>>>>>>>>                      String string = stringBuilder.toString();
>>>>>>>>>>>>>                      String encodedString = encode(string,
>>>>>> font);
>>>>>>>>>>>>>                      if(encodedString.contains("To Be
>>>>>>>> Approved")){
>>>>>>>>>>>>>                          previous.clear();
>>>>>>>>>>>>>                      }
>>>>>>>>>>>>>                  }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Note:
>>>>>>>>>>>>> In case of COSArray, I first iterate through the whole array to
>>>> get
>>>>>>>> the
>>>>>>>>>>>>> whole string before encoding and comparison and this works.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> a7mad
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
>>>>>>>>>> sahy...@fileaffairs.de
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> your text is encoded so within the show text operator Tj the
>>>>>> string
>>>>>>>> is
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 7R %H $SSURYHG
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You wrote that you encode your string to find it - what do you
>>>>>> get?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> BR
>>>>>>>>>>>>>> Maruan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <
>>>>>>>> a7med.shr...@gmail.com
>>>>>>>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Maruan,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Here's a link from where you can download the PDF.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>>>>> a7mad
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
>>>>>>>>>>>> sahy...@fileaffairs.de>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> you need to upload it to a public location as the mailing
>> list
>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>> support attachments.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> BR
>>>>>>>>>>>>>>>> Maruan
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <
>>>>>>>>>> a7med.shr...@gmail.com
>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Dear Maruan,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thank you very much for the information. Please find
>> herewith
>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>> the PDF to reproduce the problem.
>>>>>>>>>>>>>>>>> The text to remove is: "To Be Approved". The text has a
>>>>>>>> multi-byte
>>>>>>>>>>>>>>>> encoding, so I call first to encode it in order to find it
>>>> then
>>>>>>>>>> remove
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>> a7mad
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
>>>>>>>>>>>>>> sahy...@fileaffairs.de>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Dear a7mad,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> removing text from a PDF is not an easy task as
>>>>>>>>>>>>>>>>>> - text which might visually appear as a single item might
>>>>>>>>>> consistent
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> individual parts within the PDF itself e.g. each character
>> or
>>>>>>>> groups
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> characters are place individually in different COSStrings
>>>>>>>>>>>>>>>>>> - text might be drawn using graphics commands
>>>>>>>>>>>>>>>>>> - text can appear within different parts of the PDF (e.g.
>>>> the
>>>>>>>> text
>>>>>>>>>>>>>>>> might be content of a form field AND the annotation
>>>> representing
>>>>>>>> the
>>>>>>>>>>>>>> form
>>>>>>>>>>>>>>>> field visually)
>>>>>>>>>>>>>>>>>> - you need to look up the encoding information to get form
>>>> the
>>>>>>>>>>>>>>>> characters in the PDF "string" to the ones you are looking
>> for
>>>>>>>>>>>>>>>>>> ….
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If you can post a specific PDF to a public location and
>>>>>> describe
>>>>>>>>>> in
>>>>>>>>>>>>>>>> detail which string should have been replaced which hasn't I
>>>>>> will
>>>>>>>> be
>>>>>>>>>>>>>> able
>>>>>>>>>>>>>>>> to tell you why that might have happened.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Maruan
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <
>>>>>>>>>>>> a7med.shr...@gmail.com
>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Currently I am facing a strange problem removing text
>> from
>>>>>> the
>>>>>>>>>> some
>>>>>>>>>>>>>>>> PDFs.
>>>>>>>>>>>>>>>>>>> My program is able to find the text and "remove it" by
>>>>>> calling
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> COSString.reset() method.
>>>>>>>>>>>>>>>>>>> The problem is, when I open the output PDF file, I still
>>>> see
>>>>>>>> the
>>>>>>>>>>>> text
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>> not selectable (I mean when I try to highlight it with
>> the
>>>>>>>> mouse
>>>>>>>>>> to
>>>>>>>>>>>>>>>> copy
>>>>>>>>>>>>>>>>>>> it, it's not selectable!). When print the content
>> (tokens)
>>>> of
>>>>>>>> the
>>>>>>>>>>>>>>>> output
>>>>>>>>>>>>>>>>>>> file, I DO NOT find the text at all!!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and
>>>> really
>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>> of time.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's
>> going
>>>>>> on.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Notes:
>>>>>>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1
>>>>>>>>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some
>>>> PDFs
>>>>>>>>>> cause
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thank you very much.
>>>>>>>>>>>>>>>>>>> a7mad
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> users-unsubscr...@pdfbox.apache.org
>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>> users-h...@pdfbox.apache.org
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> users-unsubscr...@pdfbox.apache.org
>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>> users-h...@pdfbox.apache.org
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Text removal

Reply via email to