Re: Text removal

Maruan Sahyoun Tue, 24 Mar 2015 02:07:18 -0700

Hi,

> Am 24.03.2015 um 09:55 schrieb a7med shre3y <[email protected]>:
> 
> You can download it from here:
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
>


looking more closely you correctly replaced the text, but that text was in 
there for searching within the PDF as it used text rendering mode 3 
(invisible). The 'text' you are still seeing is drawn using vector commands so 
it's graphics content.

BR
Maruan


> Best Regards,
> 
> 
> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <[email protected]>
> wrote:
> 
>> 
>> 
>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <[email protected]>:
>>> 
>>> Hi,
>>> 
>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
>>> Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
>>> not the opposite (or at least I don't know). I spent some quite long time
>>> trying to find out how to find the character codes for the glyphs in the
>>> currently used font, then I found that it's not an easy task. By the way,
>>> if you know how to do that, I'd so much appreciate it because I need that
>>> for replacing text with another text and for that the new text must be
>>> encoded the same way as the original!
>>> 
>>> Back to the text removal, I am able to find the text and also remove it
>> by
>>> calling reset, as I mentioned in my first email, when I print the output
>>> content I don't find the text anymore but I still see it when I open the
>>> file. My first assumption was that there must be some other way to remove
>>> the text other than the way I am using, and that's what you've actually
>>> confirmed in your reply, so could you please tell me what still missing?
>>> 
>> 
>> Could you upload the PDF with the reset text too?
>> 
>> BR
>> Maruan
>> 
>> 
>>> Thanks and regards,
>>> a7mad
>>> 
>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <[email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <[email protected]>:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Here's how I do it:
>>>>> 
>>>>> 1. I use the following method to encode the text:
>>>>> 
>>>>> String encode(String text, PDFont font) throws Exception {
>>>>>      StringBuilder builder = new StringBuilder();
>>>>>      byte[] stringBytes = text.getBytes();
>>>>>      int codeLength = 1;
>>>>>      for(int i = 0; i < stringBytes.length; i += codeLength){
>>>>>              String c = font.encode(stringBytes, i, codeLength);
>>>>>              if(c == null && (i + 1 < stringBytes.length)){
>>>>>                  codeLength++;
>>>>>                  c = font.encode(stringBytes, i, codeLength);
>>>>>              }
>>>>>              builder.append(c);
>>>>>          }
>>>>>      return builder.toString();
>>>>>  }
>>>>> 
>>>>> 2. Iterating through the tokens, I find the text either it's a
>> COSString
>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the
>> text
>>>>> I'm looking for to remove as following:
>>>>> 
>>>>> if (op.getOperation().equals("Tj")) {
>>>>>                          COSString previous = (COSString) tokens.get(j
>>>> -
>>>>> 1);
>>>>>                          String string = previous.getString();
>>>>>                          String encodedString = encode(string, font);
>>>> 
>>>> that string is already encoded. So you'd need to encode "To Be Approved"
>>>> and compare if that matches the string you are reading from the PDF.
>>>> 
>>>>>                          if(encodedString.contains("To Be Approved")){
>>>>>                              previous.reset();
>>>>>                          }
>>>>>                      } else if (op.getOperation().equals("TJ")) {
>>>>>                          COSArray previous = (COSArray) tokens.get(j -
>>>>> 1);
>>>>>                          StringBuilder stringBuilder = new
>>>>> StringBuilder();
>>>>>                          for (int k = 0; k < previous.size(); k++) {
>>>>>                              Object arrElement =
>> previous.getObject(k);
>>>>>                              if (arrElement instanceof COSString) {
>>>>>                                  COSString cosString = (COSString)
>>>>> arrElement;
>>>>> 
>>>>> stringBuilder.append(cosString.getString());
>>>>>                              }
>>>>>                          }
>>>>>                          String string = stringBuilder.toString();
>>>>>                          String encodedString = encode(string, font);
>>>>>                          if(encodedString.contains("To Be Approved")){
>>>>>                              previous.clear();
>>>>>                          }
>>>>>                      }
>>>>> 
>>>>> Note:
>>>>> In case of COSArray, I first iterate through the whole array to get the
>>>>> whole string before encoding and comparison and this works.
>>>>> 
>>>>> Best Regards,
>>>>> a7mad
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
>> [email protected]
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> your text is encoded so within the show text operator Tj the string is
>>>>>> 
>>>>>> 7R %H $SSURYHG
>>>>>> 
>>>>>> You wrote that you encode your string to find it - what do you get?
>>>>>> 
>>>>>> BR
>>>>>> Maruan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <[email protected]
>>> :
>>>>>>> 
>>>>>>> Hi Maruan,
>>>>>>> 
>>>>>>> Here's a link from where you can download the PDF.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>>>>>> 
>>>>>>> Kind Regards,
>>>>>>> a7mad
>>>>>>> 
>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
>>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> you need to upload it to a public location as the mailing list
>> doesn't
>>>>>>>> support attachments.
>>>>>>>> 
>>>>>>>> BR
>>>>>>>> Maruan
>>>>>>>> 
>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <
>> [email protected]
>>>>> :
>>>>>>>>> 
>>>>>>>>> Dear Maruan,
>>>>>>>>> 
>>>>>>>>> Thank you very much for the information. Please find herewith
>>>> attached
>>>>>>>> the PDF to reproduce the problem.
>>>>>>>>> The text to remove is: "To Be Approved". The text has a multi-byte
>>>>>>>> encoding, so I call first to encode it in order to find it then
>> remove
>>>>>> it.
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> a7mad
>>>>>>>>> 
>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>>> Dear a7mad,
>>>>>>>>>> 
>>>>>>>>>> removing text from a PDF is not an easy task as
>>>>>>>>>> - text which might visually appear as a single item might
>> consistent
>>>>>> of
>>>>>>>> individual parts within the PDF itself e.g. each character or groups
>>>> of
>>>>>>>> characters are place individually in different COSStrings
>>>>>>>>>> - text might be drawn using graphics commands
>>>>>>>>>> - text can appear within different parts of the PDF (e.g. the text
>>>>>>>> might be content of a form field AND the annotation representing the
>>>>>> form
>>>>>>>> field visually)
>>>>>>>>>> - you need to look up the encoding information to get form the
>>>>>>>> characters in the PDF "string" to the ones you are looking for
>>>>>>>>>> ….
>>>>>>>>>> 
>>>>>>>>>> If you can post a specific PDF to a public location and describe
>> in
>>>>>>>> detail which string should have been replaced which hasn't I will be
>>>>>> able
>>>>>>>> to tell you why that might have happened.
>>>>>>>>>> 
>>>>>>>>>> Maruan
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <
>>>> [email protected]
>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> Currently I am facing a strange problem removing text from the
>> some
>>>>>>>> PDFs.
>>>>>>>>>>> My program is able to find the text and "remove it" by calling
>> the
>>>>>>>>>>> COSString.reset() method.
>>>>>>>>>>> The problem is, when I open the output PDF file, I still see the
>>>> text
>>>>>>>> but
>>>>>>>>>>> not selectable (I mean when I try to highlight it with the mouse
>> to
>>>>>>>> copy
>>>>>>>>>>> it, it's not selectable!). When print the content (tokens) of the
>>>>>>>> output
>>>>>>>>>>> file, I DO NOT find the text at all!!
>>>>>>>>>>> 
>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and really
>>>> running
>>>>>>>> out
>>>>>>>>>>> of time.
>>>>>>>>>>> 
>>>>>>>>>>> I'd so much appreciate any help or any idea on what's going on.
>>>>>>>>>>> 
>>>>>>>>>>> Notes:
>>>>>>>>>>> 1. I use use PDFBox 1.7.1
>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs
>> cause
>>>>>>>> this
>>>>>>>>>>> problem.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you very much.
>>>>>>>>>>> a7mad
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Text removal

Reply via email to