Interpreting vector and pixel glyphs for characters

2015-03-24 Thread Peter Murray-Rust
On Tue, Mar 24, 2015 at 9:26 AM, Maruan Sahyoun 
wrote:

>... As you would like to remove certain vectors which are matching a
certain >character/glyph you first need to find out which are the ones
drawing e.g. the letter >'T'. I don't think that this is doable in a
reasonable amount of time for arbitary text.

>Maruan

This is true! And it's unfortunately a common problem with PDFs which use
* outline fonts/glyphs
* pixel glyphs
* scanned text

I think it is possible in limited subdomains and we are starting to try to
do this in science/maths. Our approach (
https://bitbucket.org/petermr/diagramanalyzer,
https://bitbucket.org/petermr/imageanalysis,
https://bitbucket.org/petermr/javaocr) is to create tools that recognize
text in common fonts. Unfortunately there is no clear library for OCR in
Java (we looked at all of them - Tesseract is non-native - and have ended
up extending javaocr).

Scanned typescript can be a nightmare (missing pixels, bleeding across
glyph boundaries, etc.) but sometimes works.
In our approach we try to analyze born-digital glyphs by heuristics rather
than machine-learning (which needs retraining for all new fonts/size). The
vector glyphs have a constant SVG signature for each character and this can
sometimes be worked out, or mapped by the crowd). The pixel glyphs are
harder and we shrink them to a common skeleton and classify from that. Once
one character is done it's usually possible to recognize it in later
occurrences.

It's early days, but it people are interested in collaborating or have
better solutions we'd be interested (we aren't able to help with casual
problems).

P.




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069


Re: Text removal

2015-03-24 Thread Maruan Sahyoun

> Am 24.03.2015 um 12:49 schrieb a7med shre3y :
> 
> The question here is how does the text still show up in the output file???

as written earlier before the 'text' is a drawing i.e. vector graphics the same 
way the ellipses have been drawn.


> I assume the text should have been cached somewhere else in the PDF! I
> don't know if my assumption is correct, do you have any explanation for
> that?
> 
> On Tue, Mar 24, 2015 at 10:46 AM, Maruan Sahyoun 
> wrote:
> 
>> 
>>> Am 24.03.2015 um 10:43 schrieb a7med shre3y :
>>> 
>>> I mean how to find them in the PDF while rotating over the tokens, what
>> is
>>> the operator?
>>> 
>>> On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun >> 
>>> wrote:
>>> 
 
> Am 24.03.2015 um 10:36 schrieb a7med shre3y :
> 
> What are the drawing commands? I'd then investigate one how to specify
 the
> text ones.
> 
 
 738.7469 167.1278 m
>> 
>> MoveTo
>> 
 733.8743 167.1278 l
 
>> 
>> LineTo
>> 
>> 
 
 
> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <
>> sahy...@fileaffairs.de
> 
> wrote:
> 
>> 
>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y >> :
>>> 
>>> That's true, I've even tried to change the rendering text mode to
>> other
>>> values already as mentioned in the PDF specs 1.5 table 5.3 before
>> removing
>>> it also didn't work.
>>> So how to remove the graphics content then?
>> 
>> the simple answer - remove the drawing commands.
>> 
>> The longer answer as you obviously don't want to remove all drawing
>> commands you'd need to find which are the ones drawing the text. As
>> you
>> would like to remove certain vectors which are matching a certain
>> character/glyph you first need to find out which are the ones drawing
 e.g.
>> the letter 'T'. I don't think that this is doable in a reasonable
 amount of
>> time for arbitary text.
>> 
>> Maruan
>> 
>> 
>>> 
>>> Best Regards,
>>> 
>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <
 sahy...@fileaffairs.de
>>> 
>>> wrote:
>>> 
 Hi,
 
> Am 24.03.2015 um 09:55 schrieb a7med shre3y <
>> a7med.shr...@gmail.com
> :
> 
> You can download it from here:
> 
 
>> 
 
>> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> 
 
 looking more closely you correctly replaced the text, but that text
 was
>> in
 there for searching within the PDF as it used text rendering mode 3
 (invisible). The 'text' you are still seeing is drawn using vector
>> commands
 so it's graphics content.
 
 BR
 Maruan
 
 
> Best Regards,
> 
> 
> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
>> sahy...@fileaffairs.de>
> wrote:
> 
>> 
>> 
>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <
 a7med.shr...@gmail.com
>>> :
>>> 
>>> Hi,
>>> 
>>> In fact PDFBox call the operation of transforming "7R %H
>> $SSURYHG"
 to
 "To
>>> Be Approved" as "encoding". Anyway, either it's encoding or
>> decoding, I
>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be
 Approved"
 and
>>> not the opposite (or at least I don't know). I spent some quite
 long
 time
>>> trying to find out how to find the character codes for the glyphs
 in
 the
>>> currently used font, then I found that it's not an easy task. By
 the
 way,
>>> if you know how to do that, I'd so much appreciate it because I
 need
 that
>>> for replacing text with another text and for that the new text
>> must
>> be
>>> encoded the same way as the original!
>>> 
>>> Back to the text removal, I am able to find the text and also
 remove
>> it
>> by
>>> calling reset, as I mentioned in my first email, when I print the
 output
>>> content I don't find the text anymore but I still see it when I
 open
 the
>>> file. My first assumption was that there must be some other way
>> to
 remove
>>> the text other than the way I am using, and that's what you've
>> actually
>>> confirmed in your reply, so could you please tell me what still
 missing?
>>> 
>> 
>> Could you upload the PDF with the reset text too?
>> 
>> BR
>> Maruan
>> 
>> 
>>> Thanks and regards,
>>> a7mad
>>> 
>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
 sahy...@fileaffairs.de>
>>> wrote:
>>> 
 Hi,
 
> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
>>>

Re: Text removal

2015-03-24 Thread a7med shre3y
The question here is how does the text still show up in the output file???
I assume the text should have been cached somewhere else in the PDF! I
don't know if my assumption is correct, do you have any explanation for
that?

On Tue, Mar 24, 2015 at 10:46 AM, Maruan Sahyoun 
wrote:

>
> > Am 24.03.2015 um 10:43 schrieb a7med shre3y :
> >
> > I mean how to find them in the PDF while rotating over the tokens, what
> is
> > the operator?
> >
> > On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun  >
> > wrote:
> >
> >>
> >>> Am 24.03.2015 um 10:36 schrieb a7med shre3y :
> >>>
> >>> What are the drawing commands? I'd then investigate one how to specify
> >> the
> >>> text ones.
> >>>
> >>
> >> 738.7469 167.1278 m
>
> MoveTo
>
> >> 733.8743 167.1278 l
> >>
>
> LineTo
>
>
> >>
> >>
> >>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <
> sahy...@fileaffairs.de
> >>>
> >>> wrote:
> >>>
> 
> > Am 24.03.2015 um 10:14 schrieb a7med shre3y  >:
> >
> > That's true, I've even tried to change the rendering text mode to
> other
> > values already as mentioned in the PDF specs 1.5 table 5.3 before
>  removing
> > it also didn't work.
> > So how to remove the graphics content then?
> 
>  the simple answer - remove the drawing commands.
> 
>  The longer answer as you obviously don't want to remove all drawing
>  commands you'd need to find which are the ones drawing the text. As
> you
>  would like to remove certain vectors which are matching a certain
>  character/glyph you first need to find out which are the ones drawing
> >> e.g.
>  the letter 'T'. I don't think that this is doable in a reasonable
> >> amount of
>  time for arbitary text.
> 
>  Maruan
> 
> 
> >
> > Best Regards,
> >
> > On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <
> >> sahy...@fileaffairs.de
> >
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <
> a7med.shr...@gmail.com
> >>> :
> >>>
> >>> You can download it from here:
> >>>
> >>
> 
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> >>>
> >>
> >> looking more closely you correctly replaced the text, but that text
> >> was
>  in
> >> there for searching within the PDF as it used text rendering mode 3
> >> (invisible). The 'text' you are still seeing is drawn using vector
>  commands
> >> so it's graphics content.
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>> Best Regards,
> >>>
> >>>
> >>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
>  sahy...@fileaffairs.de>
> >>> wrote:
> >>>
> 
> 
> > Am 24.03.2015 um 09:40 schrieb a7med shre3y <
> >> a7med.shr...@gmail.com
> > :
> >
> > Hi,
> >
> > In fact PDFBox call the operation of transforming "7R %H
> $SSURYHG"
> >> to
> >> "To
> > Be Approved" as "encoding". Anyway, either it's encoding or
>  decoding, I
> > thought it's easier to transform "7R %H $SSURYHG" to "To Be
> >> Approved"
> >> and
> > not the opposite (or at least I don't know). I spent some quite
> >> long
> >> time
> > trying to find out how to find the character codes for the glyphs
> >> in
> >> the
> > currently used font, then I found that it's not an easy task. By
> >> the
> >> way,
> > if you know how to do that, I'd so much appreciate it because I
> >> need
> >> that
> > for replacing text with another text and for that the new text
> must
>  be
> > encoded the same way as the original!
> >
> > Back to the text removal, I am able to find the text and also
> >> remove
>  it
>  by
> > calling reset, as I mentioned in my first email, when I print the
> >> output
> > content I don't find the text anymore but I still see it when I
> >> open
> >> the
> > file. My first assumption was that there must be some other way
> to
> >> remove
> > the text other than the way I am using, and that's what you've
>  actually
> > confirmed in your reply, so could you please tell me what still
> >> missing?
> >
> 
>  Could you upload the PDF with the reset text too?
> 
>  BR
>  Maruan
> 
> 
> > Thanks and regards,
> > a7mad
> >
> > On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
> >> sahy...@fileaffairs.de>
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
>  a7med.shr...@gmail.com
> >>> :
> >>>
> >>> Hi,
> >>>
> >>> Here's how I do it:
> >>>
> >>> 1. I use the following method to encode the text:
> >>>
> >>> String encode(String text, PDFont fo

Re: Text removal

2015-03-24 Thread Maruan Sahyoun

> Am 24.03.2015 um 10:43 schrieb a7med shre3y :
> 
> I mean how to find them in the PDF while rotating over the tokens, what is
> the operator?
> 
> On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun 
> wrote:
> 
>> 
>>> Am 24.03.2015 um 10:36 schrieb a7med shre3y :
>>> 
>>> What are the drawing commands? I'd then investigate one how to specify
>> the
>>> text ones.
>>> 
>> 
>> 738.7469 167.1278 m

MoveTo

>> 733.8743 167.1278 l
>> 

LineTo


>> 
>> 
>>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun >> 
>>> wrote:
>>> 
 
> Am 24.03.2015 um 10:14 schrieb a7med shre3y :
> 
> That's true, I've even tried to change the rendering text mode to other
> values already as mentioned in the PDF specs 1.5 table 5.3 before
 removing
> it also didn't work.
> So how to remove the graphics content then?
 
 the simple answer - remove the drawing commands.
 
 The longer answer as you obviously don't want to remove all drawing
 commands you'd need to find which are the ones drawing the text. As you
 would like to remove certain vectors which are matching a certain
 character/glyph you first need to find out which are the ones drawing
>> e.g.
 the letter 'T'. I don't think that this is doable in a reasonable
>> amount of
 time for arbitary text.
 
 Maruan
 
 
> 
> Best Regards,
> 
> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <
>> sahy...@fileaffairs.de
> 
> wrote:
> 
>> Hi,
>> 
>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y >> :
>>> 
>>> You can download it from here:
>>> 
>> 
 
>> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
>>> 
>> 
>> looking more closely you correctly replaced the text, but that text
>> was
 in
>> there for searching within the PDF as it used text rendering mode 3
>> (invisible). The 'text' you are still seeing is drawn using vector
 commands
>> so it's graphics content.
>> 
>> BR
>> Maruan
>> 
>> 
>>> Best Regards,
>>> 
>>> 
>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
 sahy...@fileaffairs.de>
>>> wrote:
>>> 
 
 
> Am 24.03.2015 um 09:40 schrieb a7med shre3y <
>> a7med.shr...@gmail.com
> :
> 
> Hi,
> 
> In fact PDFBox call the operation of transforming "7R %H $SSURYHG"
>> to
>> "To
> Be Approved" as "encoding". Anyway, either it's encoding or
 decoding, I
> thought it's easier to transform "7R %H $SSURYHG" to "To Be
>> Approved"
>> and
> not the opposite (or at least I don't know). I spent some quite
>> long
>> time
> trying to find out how to find the character codes for the glyphs
>> in
>> the
> currently used font, then I found that it's not an easy task. By
>> the
>> way,
> if you know how to do that, I'd so much appreciate it because I
>> need
>> that
> for replacing text with another text and for that the new text must
 be
> encoded the same way as the original!
> 
> Back to the text removal, I am able to find the text and also
>> remove
 it
 by
> calling reset, as I mentioned in my first email, when I print the
>> output
> content I don't find the text anymore but I still see it when I
>> open
>> the
> file. My first assumption was that there must be some other way to
>> remove
> the text other than the way I am using, and that's what you've
 actually
> confirmed in your reply, so could you please tell me what still
>> missing?
> 
 
 Could you upload the PDF with the reset text too?
 
 BR
 Maruan
 
 
> Thanks and regards,
> a7mad
> 
> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
>> sahy...@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
 a7med.shr...@gmail.com
>>> :
>>> 
>>> Hi,
>>> 
>>> Here's how I do it:
>>> 
>>> 1. I use the following method to encode the text:
>>> 
>>> String encode(String text, PDFont font) throws Exception {
>>>   StringBuilder builder = new StringBuilder();
>>>   byte[] stringBytes = text.getBytes();
>>>   int codeLength = 1;
>>>   for(int i = 0; i < stringBytes.length; i += codeLength){
>>>   String c = font.encode(stringBytes, i, codeLength);
>>>   if(c == null && (i + 1 < stringBytes.length)){
>>>   codeLength++;
>>>   c = font.encode(stringBytes, i, codeLength);
>>>   }
>>>   builder.append(c);
>>>   }
>>>   return builder.toString();
>>> }
>

Re: Text removal

2015-03-24 Thread a7med shre3y
I mean how to find them in the PDF while rotating over the tokens, what is
the operator?

On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun 
wrote:

>
> > Am 24.03.2015 um 10:36 schrieb a7med shre3y :
> >
> > What are the drawing commands? I'd then investigate one how to specify
> the
> > text ones.
> >
>
> 738.7469 167.1278 m
> 733.8743 167.1278 l
>
>
>
> > On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun  >
> > wrote:
> >
> >>
> >>> Am 24.03.2015 um 10:14 schrieb a7med shre3y :
> >>>
> >>> That's true, I've even tried to change the rendering text mode to other
> >>> values already as mentioned in the PDF specs 1.5 table 5.3 before
> >> removing
> >>> it also didn't work.
> >>> So how to remove the graphics content then?
> >>
> >> the simple answer - remove the drawing commands.
> >>
> >> The longer answer as you obviously don't want to remove all drawing
> >> commands you'd need to find which are the ones drawing the text. As you
> >> would like to remove certain vectors which are matching a certain
> >> character/glyph you first need to find out which are the ones drawing
> e.g.
> >> the letter 'T'. I don't think that this is doable in a reasonable
> amount of
> >> time for arbitary text.
> >>
> >> Maruan
> >>
> >>
> >>>
> >>> Best Regards,
> >>>
> >>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <
> sahy...@fileaffairs.de
> >>>
> >>> wrote:
> >>>
>  Hi,
> 
> > Am 24.03.2015 um 09:55 schrieb a7med shre3y  >:
> >
> > You can download it from here:
> >
> 
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> >
> 
>  looking more closely you correctly replaced the text, but that text
> was
> >> in
>  there for searching within the PDF as it used text rendering mode 3
>  (invisible). The 'text' you are still seeing is drawn using vector
> >> commands
>  so it's graphics content.
> 
>  BR
>  Maruan
> 
> 
> > Best Regards,
> >
> >
> > On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
> >> sahy...@fileaffairs.de>
> > wrote:
> >
> >>
> >>
> >>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <
> a7med.shr...@gmail.com
> >>> :
> >>>
> >>> Hi,
> >>>
> >>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG"
> to
>  "To
> >>> Be Approved" as "encoding". Anyway, either it's encoding or
> >> decoding, I
> >>> thought it's easier to transform "7R %H $SSURYHG" to "To Be
> Approved"
>  and
> >>> not the opposite (or at least I don't know). I spent some quite
> long
>  time
> >>> trying to find out how to find the character codes for the glyphs
> in
>  the
> >>> currently used font, then I found that it's not an easy task. By
> the
>  way,
> >>> if you know how to do that, I'd so much appreciate it because I
> need
>  that
> >>> for replacing text with another text and for that the new text must
> >> be
> >>> encoded the same way as the original!
> >>>
> >>> Back to the text removal, I am able to find the text and also
> remove
> >> it
> >> by
> >>> calling reset, as I mentioned in my first email, when I print the
>  output
> >>> content I don't find the text anymore but I still see it when I
> open
>  the
> >>> file. My first assumption was that there must be some other way to
>  remove
> >>> the text other than the way I am using, and that's what you've
> >> actually
> >>> confirmed in your reply, so could you please tell me what still
>  missing?
> >>>
> >>
> >> Could you upload the PDF with the reset text too?
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>> Thanks and regards,
> >>> a7mad
> >>>
> >>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
>  sahy...@fileaffairs.de>
> >>> wrote:
> >>>
>  Hi,
> 
> > Am 24.03.2015 um 08:14 schrieb a7med shre3y <
> >> a7med.shr...@gmail.com
> > :
> >
> > Hi,
> >
> > Here's how I do it:
> >
> > 1. I use the following method to encode the text:
> >
> > String encode(String text, PDFont font) throws Exception {
> >StringBuilder builder = new StringBuilder();
> >byte[] stringBytes = text.getBytes();
> >int codeLength = 1;
> >for(int i = 0; i < stringBytes.length; i += codeLength){
> >String c = font.encode(stringBytes, i, codeLength);
> >if(c == null && (i + 1 < stringBytes.length)){
> >codeLength++;
> >c = font.encode(stringBytes, i, codeLength);
> >}
> >builder.append(c);
> >}
> >return builder.toString();
> > }
> >
> > 2. Iterating through the tokens, I find the text either it's a
> >> COSString
> > ("Tj" operator) or a COSArray (

Re: Text removal

2015-03-24 Thread Maruan Sahyoun

> Am 24.03.2015 um 10:36 schrieb a7med shre3y :
> 
> What are the drawing commands? I'd then investigate one how to specify the
> text ones.
> 

738.7469 167.1278 m
733.8743 167.1278 l



> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun 
> wrote:
> 
>> 
>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y :
>>> 
>>> That's true, I've even tried to change the rendering text mode to other
>>> values already as mentioned in the PDF specs 1.5 table 5.3 before
>> removing
>>> it also didn't work.
>>> So how to remove the graphics content then?
>> 
>> the simple answer - remove the drawing commands.
>> 
>> The longer answer as you obviously don't want to remove all drawing
>> commands you'd need to find which are the ones drawing the text. As you
>> would like to remove certain vectors which are matching a certain
>> character/glyph you first need to find out which are the ones drawing e.g.
>> the letter 'T'. I don't think that this is doable in a reasonable amount of
>> time for arbitary text.
>> 
>> Maruan
>> 
>> 
>>> 
>>> Best Regards,
>>> 
>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun >> 
>>> wrote:
>>> 
 Hi,
 
> Am 24.03.2015 um 09:55 schrieb a7med shre3y :
> 
> You can download it from here:
> 
 
>> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> 
 
 looking more closely you correctly replaced the text, but that text was
>> in
 there for searching within the PDF as it used text rendering mode 3
 (invisible). The 'text' you are still seeing is drawn using vector
>> commands
 so it's graphics content.
 
 BR
 Maruan
 
 
> Best Regards,
> 
> 
> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
>> sahy...@fileaffairs.de>
> wrote:
> 
>> 
>> 
>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y >> :
>>> 
>>> Hi,
>>> 
>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to
 "To
>>> Be Approved" as "encoding". Anyway, either it's encoding or
>> decoding, I
>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved"
 and
>>> not the opposite (or at least I don't know). I spent some quite long
 time
>>> trying to find out how to find the character codes for the glyphs in
 the
>>> currently used font, then I found that it's not an easy task. By the
 way,
>>> if you know how to do that, I'd so much appreciate it because I need
 that
>>> for replacing text with another text and for that the new text must
>> be
>>> encoded the same way as the original!
>>> 
>>> Back to the text removal, I am able to find the text and also remove
>> it
>> by
>>> calling reset, as I mentioned in my first email, when I print the
 output
>>> content I don't find the text anymore but I still see it when I open
 the
>>> file. My first assumption was that there must be some other way to
 remove
>>> the text other than the way I am using, and that's what you've
>> actually
>>> confirmed in your reply, so could you please tell me what still
 missing?
>>> 
>> 
>> Could you upload the PDF with the reset text too?
>> 
>> BR
>> Maruan
>> 
>> 
>>> Thanks and regards,
>>> a7mad
>>> 
>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
 sahy...@fileaffairs.de>
>>> wrote:
>>> 
 Hi,
 
> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
>> a7med.shr...@gmail.com
> :
> 
> Hi,
> 
> Here's how I do it:
> 
> 1. I use the following method to encode the text:
> 
> String encode(String text, PDFont font) throws Exception {
>StringBuilder builder = new StringBuilder();
>byte[] stringBytes = text.getBytes();
>int codeLength = 1;
>for(int i = 0; i < stringBytes.length; i += codeLength){
>String c = font.encode(stringBytes, i, codeLength);
>if(c == null && (i + 1 < stringBytes.length)){
>codeLength++;
>c = font.encode(stringBytes, i, codeLength);
>}
>builder.append(c);
>}
>return builder.toString();
> }
> 
> 2. Iterating through the tokens, I find the text either it's a
>> COSString
> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's
>> the
>> text
> I'm looking for to remove as following:
> 
> if (op.getOperation().equals("Tj")) {
>COSString previous = (COSString)
 tokens.get(j
 -
> 1);
>String string = previous.getString();
>String encodedString = encode(string,
>> font);
 
 that string is already encoded. So you'd need to en

Re: Text removal

2015-03-24 Thread a7med shre3y
What are the drawing commands? I'd then investigate one how to specify the
text ones.

On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun 
wrote:

>
> > Am 24.03.2015 um 10:14 schrieb a7med shre3y :
> >
> > That's true, I've even tried to change the rendering text mode to other
> > values already as mentioned in the PDF specs 1.5 table 5.3 before
> removing
> > it also didn't work.
> > So how to remove the graphics content then?
>
> the simple answer - remove the drawing commands.
>
> The longer answer as you obviously don't want to remove all drawing
> commands you'd need to find which are the ones drawing the text. As you
> would like to remove certain vectors which are matching a certain
> character/glyph you first need to find out which are the ones drawing e.g.
> the letter 'T'. I don't think that this is doable in a reasonable amount of
> time for arbitary text.
>
> Maruan
>
>
> >
> > Best Regards,
> >
> > On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun  >
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 24.03.2015 um 09:55 schrieb a7med shre3y :
> >>>
> >>> You can download it from here:
> >>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> >>>
> >>
> >> looking more closely you correctly replaced the text, but that text was
> in
> >> there for searching within the PDF as it used text rendering mode 3
> >> (invisible). The 'text' you are still seeing is drawn using vector
> commands
> >> so it's graphics content.
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>> Best Regards,
> >>>
> >>>
> >>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
> sahy...@fileaffairs.de>
> >>> wrote:
> >>>
> 
> 
> > Am 24.03.2015 um 09:40 schrieb a7med shre3y  >:
> >
> > Hi,
> >
> > In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to
> >> "To
> > Be Approved" as "encoding". Anyway, either it's encoding or
> decoding, I
> > thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved"
> >> and
> > not the opposite (or at least I don't know). I spent some quite long
> >> time
> > trying to find out how to find the character codes for the glyphs in
> >> the
> > currently used font, then I found that it's not an easy task. By the
> >> way,
> > if you know how to do that, I'd so much appreciate it because I need
> >> that
> > for replacing text with another text and for that the new text must
> be
> > encoded the same way as the original!
> >
> > Back to the text removal, I am able to find the text and also remove
> it
>  by
> > calling reset, as I mentioned in my first email, when I print the
> >> output
> > content I don't find the text anymore but I still see it when I open
> >> the
> > file. My first assumption was that there must be some other way to
> >> remove
> > the text other than the way I am using, and that's what you've
> actually
> > confirmed in your reply, so could you please tell me what still
> >> missing?
> >
> 
>  Could you upload the PDF with the reset text too?
> 
>  BR
>  Maruan
> 
> 
> > Thanks and regards,
> > a7mad
> >
> > On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
> >> sahy...@fileaffairs.de>
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
> a7med.shr...@gmail.com
> >>> :
> >>>
> >>> Hi,
> >>>
> >>> Here's how I do it:
> >>>
> >>> 1. I use the following method to encode the text:
> >>>
> >>> String encode(String text, PDFont font) throws Exception {
> >>> StringBuilder builder = new StringBuilder();
> >>> byte[] stringBytes = text.getBytes();
> >>> int codeLength = 1;
> >>> for(int i = 0; i < stringBytes.length; i += codeLength){
> >>> String c = font.encode(stringBytes, i, codeLength);
> >>> if(c == null && (i + 1 < stringBytes.length)){
> >>> codeLength++;
> >>> c = font.encode(stringBytes, i, codeLength);
> >>> }
> >>> builder.append(c);
> >>> }
> >>> return builder.toString();
> >>> }
> >>>
> >>> 2. Iterating through the tokens, I find the text either it's a
>  COSString
> >>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's
> the
>  text
> >>> I'm looking for to remove as following:
> >>>
> >>> if (op.getOperation().equals("Tj")) {
> >>> COSString previous = (COSString)
> >> tokens.get(j
> >> -
> >>> 1);
> >>> String string = previous.getString();
> >>> String encodedString = encode(string,
> font);
> >>
> >> that string is already encoded. So you'd need to encode "To Be
> >> Approved"
> >> and compare if that matches the string you are reading from the PDF.
> >>
> >>> if(encodedStrin

Re: Text removal

2015-03-24 Thread Maruan Sahyoun

> Am 24.03.2015 um 10:14 schrieb a7med shre3y :
> 
> That's true, I've even tried to change the rendering text mode to other
> values already as mentioned in the PDF specs 1.5 table 5.3 before removing
> it also didn't work.
> So how to remove the graphics content then?

the simple answer - remove the drawing commands.

The longer answer as you obviously don't want to remove all drawing commands 
you'd need to find which are the ones drawing the text. As you would like to 
remove certain vectors which are matching a certain character/glyph you first 
need to find out which are the ones drawing e.g. the letter 'T'. I don't think 
that this is doable in a reasonable amount of time for arbitary text.

Maruan


> 
> Best Regards,
> 
> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun 
> wrote:
> 
>> Hi,
>> 
>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y :
>>> 
>>> You can download it from here:
>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
>>> 
>> 
>> looking more closely you correctly replaced the text, but that text was in
>> there for searching within the PDF as it used text rendering mode 3
>> (invisible). The 'text' you are still seeing is drawn using vector commands
>> so it's graphics content.
>> 
>> BR
>> Maruan
>> 
>> 
>>> Best Regards,
>>> 
>>> 
>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun 
>>> wrote:
>>> 
 
 
> Am 24.03.2015 um 09:40 schrieb a7med shre3y :
> 
> Hi,
> 
> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to
>> "To
> Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved"
>> and
> not the opposite (or at least I don't know). I spent some quite long
>> time
> trying to find out how to find the character codes for the glyphs in
>> the
> currently used font, then I found that it's not an easy task. By the
>> way,
> if you know how to do that, I'd so much appreciate it because I need
>> that
> for replacing text with another text and for that the new text must be
> encoded the same way as the original!
> 
> Back to the text removal, I am able to find the text and also remove it
 by
> calling reset, as I mentioned in my first email, when I print the
>> output
> content I don't find the text anymore but I still see it when I open
>> the
> file. My first assumption was that there must be some other way to
>> remove
> the text other than the way I am using, and that's what you've actually
> confirmed in your reply, so could you please tell me what still
>> missing?
> 
 
 Could you upload the PDF with the reset text too?
 
 BR
 Maruan
 
 
> Thanks and regards,
> a7mad
> 
> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
>> sahy...@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y >> :
>>> 
>>> Hi,
>>> 
>>> Here's how I do it:
>>> 
>>> 1. I use the following method to encode the text:
>>> 
>>> String encode(String text, PDFont font) throws Exception {
>>> StringBuilder builder = new StringBuilder();
>>> byte[] stringBytes = text.getBytes();
>>> int codeLength = 1;
>>> for(int i = 0; i < stringBytes.length; i += codeLength){
>>> String c = font.encode(stringBytes, i, codeLength);
>>> if(c == null && (i + 1 < stringBytes.length)){
>>> codeLength++;
>>> c = font.encode(stringBytes, i, codeLength);
>>> }
>>> builder.append(c);
>>> }
>>> return builder.toString();
>>> }
>>> 
>>> 2. Iterating through the tokens, I find the text either it's a
 COSString
>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the
 text
>>> I'm looking for to remove as following:
>>> 
>>> if (op.getOperation().equals("Tj")) {
>>> COSString previous = (COSString)
>> tokens.get(j
>> -
>>> 1);
>>> String string = previous.getString();
>>> String encodedString = encode(string, font);
>> 
>> that string is already encoded. So you'd need to encode "To Be
>> Approved"
>> and compare if that matches the string you are reading from the PDF.
>> 
>>> if(encodedString.contains("To Be
>> Approved")){
>>> previous.reset();
>>> }
>>> } else if (op.getOperation().equals("TJ")) {
>>> COSArray previous = (COSArray) tokens.get(j
>> -
>>> 1);
>>> StringBuilder stringBuilder = new
>>> StringBuilder();
>>> for (int k = 0; k < previous.size(); k++) {
>>>  

Re: Text removal

2015-03-24 Thread a7med shre3y
That's true, I've even tried to change the rendering text mode to other
values already as mentioned in the PDF specs 1.5 table 5.3 before removing
it also didn't work.
So how to remove the graphics content then?

Best Regards,

On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun 
wrote:

> Hi,
>
> > Am 24.03.2015 um 09:55 schrieb a7med shre3y :
> >
> > You can download it from here:
> >
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> >
>
> looking more closely you correctly replaced the text, but that text was in
> there for searching within the PDF as it used text rendering mode 3
> (invisible). The 'text' you are still seeing is drawn using vector commands
> so it's graphics content.
>
> BR
> Maruan
>
>
> > Best Regards,
> >
> >
> > On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun 
> > wrote:
> >
> >>
> >>
> >>> Am 24.03.2015 um 09:40 schrieb a7med shre3y :
> >>>
> >>> Hi,
> >>>
> >>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to
> "To
> >>> Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
> >>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved"
> and
> >>> not the opposite (or at least I don't know). I spent some quite long
> time
> >>> trying to find out how to find the character codes for the glyphs in
> the
> >>> currently used font, then I found that it's not an easy task. By the
> way,
> >>> if you know how to do that, I'd so much appreciate it because I need
> that
> >>> for replacing text with another text and for that the new text must be
> >>> encoded the same way as the original!
> >>>
> >>> Back to the text removal, I am able to find the text and also remove it
> >> by
> >>> calling reset, as I mentioned in my first email, when I print the
> output
> >>> content I don't find the text anymore but I still see it when I open
> the
> >>> file. My first assumption was that there must be some other way to
> remove
> >>> the text other than the way I am using, and that's what you've actually
> >>> confirmed in your reply, so could you please tell me what still
> missing?
> >>>
> >>
> >> Could you upload the PDF with the reset text too?
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>> Thanks and regards,
> >>> a7mad
> >>>
> >>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
> sahy...@fileaffairs.de>
> >>> wrote:
> >>>
>  Hi,
> 
> > Am 24.03.2015 um 08:14 schrieb a7med shre3y  >:
> >
> > Hi,
> >
> > Here's how I do it:
> >
> > 1. I use the following method to encode the text:
> >
> > String encode(String text, PDFont font) throws Exception {
> >  StringBuilder builder = new StringBuilder();
> >  byte[] stringBytes = text.getBytes();
> >  int codeLength = 1;
> >  for(int i = 0; i < stringBytes.length; i += codeLength){
> >  String c = font.encode(stringBytes, i, codeLength);
> >  if(c == null && (i + 1 < stringBytes.length)){
> >  codeLength++;
> >  c = font.encode(stringBytes, i, codeLength);
> >  }
> >  builder.append(c);
> >  }
> >  return builder.toString();
> >  }
> >
> > 2. Iterating through the tokens, I find the text either it's a
> >> COSString
> > ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the
> >> text
> > I'm looking for to remove as following:
> >
> > if (op.getOperation().equals("Tj")) {
> >  COSString previous = (COSString)
> tokens.get(j
>  -
> > 1);
> >  String string = previous.getString();
> >  String encodedString = encode(string, font);
> 
>  that string is already encoded. So you'd need to encode "To Be
> Approved"
>  and compare if that matches the string you are reading from the PDF.
> 
> >  if(encodedString.contains("To Be
> Approved")){
> >  previous.reset();
> >  }
> >  } else if (op.getOperation().equals("TJ")) {
> >  COSArray previous = (COSArray) tokens.get(j
> -
> > 1);
> >  StringBuilder stringBuilder = new
> > StringBuilder();
> >  for (int k = 0; k < previous.size(); k++) {
> >  Object arrElement =
> >> previous.getObject(k);
> >  if (arrElement instanceof COSString) {
> >  COSString cosString = (COSString)
> > arrElement;
> >
> > stringBuilder.append(cosString.getString());
> >  }
> >  }
> >  String string = stringBuilder.toString();
> >  String encodedString = encode(string, font);
> >  

Re: Text removal

2015-03-24 Thread Maruan Sahyoun
Hi,

> Am 24.03.2015 um 09:55 schrieb a7med shre3y :
> 
> You can download it from here:
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> 

looking more closely you correctly replaced the text, but that text was in 
there for searching within the PDF as it used text rendering mode 3 
(invisible). The 'text' you are still seeing is drawn using vector commands so 
it's graphics content.

BR
Maruan


> Best Regards,
> 
> 
> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun 
> wrote:
> 
>> 
>> 
>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y :
>>> 
>>> Hi,
>>> 
>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
>>> Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
>>> not the opposite (or at least I don't know). I spent some quite long time
>>> trying to find out how to find the character codes for the glyphs in the
>>> currently used font, then I found that it's not an easy task. By the way,
>>> if you know how to do that, I'd so much appreciate it because I need that
>>> for replacing text with another text and for that the new text must be
>>> encoded the same way as the original!
>>> 
>>> Back to the text removal, I am able to find the text and also remove it
>> by
>>> calling reset, as I mentioned in my first email, when I print the output
>>> content I don't find the text anymore but I still see it when I open the
>>> file. My first assumption was that there must be some other way to remove
>>> the text other than the way I am using, and that's what you've actually
>>> confirmed in your reply, so could you please tell me what still missing?
>>> 
>> 
>> Could you upload the PDF with the reset text too?
>> 
>> BR
>> Maruan
>> 
>> 
>>> Thanks and regards,
>>> a7mad
>>> 
>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun 
>>> wrote:
>>> 
 Hi,
 
> Am 24.03.2015 um 08:14 schrieb a7med shre3y :
> 
> Hi,
> 
> Here's how I do it:
> 
> 1. I use the following method to encode the text:
> 
> String encode(String text, PDFont font) throws Exception {
>  StringBuilder builder = new StringBuilder();
>  byte[] stringBytes = text.getBytes();
>  int codeLength = 1;
>  for(int i = 0; i < stringBytes.length; i += codeLength){
>  String c = font.encode(stringBytes, i, codeLength);
>  if(c == null && (i + 1 < stringBytes.length)){
>  codeLength++;
>  c = font.encode(stringBytes, i, codeLength);
>  }
>  builder.append(c);
>  }
>  return builder.toString();
>  }
> 
> 2. Iterating through the tokens, I find the text either it's a
>> COSString
> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the
>> text
> I'm looking for to remove as following:
> 
> if (op.getOperation().equals("Tj")) {
>  COSString previous = (COSString) tokens.get(j
 -
> 1);
>  String string = previous.getString();
>  String encodedString = encode(string, font);
 
 that string is already encoded. So you'd need to encode "To Be Approved"
 and compare if that matches the string you are reading from the PDF.
 
>  if(encodedString.contains("To Be Approved")){
>  previous.reset();
>  }
>  } else if (op.getOperation().equals("TJ")) {
>  COSArray previous = (COSArray) tokens.get(j -
> 1);
>  StringBuilder stringBuilder = new
> StringBuilder();
>  for (int k = 0; k < previous.size(); k++) {
>  Object arrElement =
>> previous.getObject(k);
>  if (arrElement instanceof COSString) {
>  COSString cosString = (COSString)
> arrElement;
> 
> stringBuilder.append(cosString.getString());
>  }
>  }
>  String string = stringBuilder.toString();
>  String encodedString = encode(string, font);
>  if(encodedString.contains("To Be Approved")){
>  previous.clear();
>  }
>  }
> 
> Note:
> In case of COSArray, I first iterate through the whole array to get the
> whole string before encoding and comparison and this works.
> 
> Best Regards,
> a7mad
> 
> 
> 
> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
>> sahy...@fileaffairs.de
> 
> wrote:
> 
>> Hi,
>> 
>> your text is encoded s

Re: Text removal

2015-03-24 Thread a7med shre3y
You can download it from here:
https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing

Best Regards,


On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun 
wrote:

>
>
> > Am 24.03.2015 um 09:40 schrieb a7med shre3y :
> >
> > Hi,
> >
> > In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
> > Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
> > thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
> > not the opposite (or at least I don't know). I spent some quite long time
> > trying to find out how to find the character codes for the glyphs in the
> > currently used font, then I found that it's not an easy task. By the way,
> > if you know how to do that, I'd so much appreciate it because I need that
> > for replacing text with another text and for that the new text must be
> > encoded the same way as the original!
> >
> > Back to the text removal, I am able to find the text and also remove it
> by
> > calling reset, as I mentioned in my first email, when I print the output
> > content I don't find the text anymore but I still see it when I open the
> > file. My first assumption was that there must be some other way to remove
> > the text other than the way I am using, and that's what you've actually
> > confirmed in your reply, so could you please tell me what still missing?
> >
>
> Could you upload the PDF with the reset text too?
>
> BR
> Maruan
>
>
> > Thanks and regards,
> > a7mad
> >
> > On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun 
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 24.03.2015 um 08:14 schrieb a7med shre3y :
> >>>
> >>> Hi,
> >>>
> >>> Here's how I do it:
> >>>
> >>> 1. I use the following method to encode the text:
> >>>
> >>> String encode(String text, PDFont font) throws Exception {
> >>>   StringBuilder builder = new StringBuilder();
> >>>   byte[] stringBytes = text.getBytes();
> >>>   int codeLength = 1;
> >>>   for(int i = 0; i < stringBytes.length; i += codeLength){
> >>>   String c = font.encode(stringBytes, i, codeLength);
> >>>   if(c == null && (i + 1 < stringBytes.length)){
> >>>   codeLength++;
> >>>   c = font.encode(stringBytes, i, codeLength);
> >>>   }
> >>>   builder.append(c);
> >>>   }
> >>>   return builder.toString();
> >>>   }
> >>>
> >>> 2. Iterating through the tokens, I find the text either it's a
> COSString
> >>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the
> text
> >>> I'm looking for to remove as following:
> >>>
> >>> if (op.getOperation().equals("Tj")) {
> >>>   COSString previous = (COSString) tokens.get(j
> >> -
> >>> 1);
> >>>   String string = previous.getString();
> >>>   String encodedString = encode(string, font);
> >>
> >> that string is already encoded. So you'd need to encode "To Be Approved"
> >> and compare if that matches the string you are reading from the PDF.
> >>
> >>>   if(encodedString.contains("To Be Approved")){
> >>>   previous.reset();
> >>>   }
> >>>   } else if (op.getOperation().equals("TJ")) {
> >>>   COSArray previous = (COSArray) tokens.get(j -
> >>> 1);
> >>>   StringBuilder stringBuilder = new
> >>> StringBuilder();
> >>>   for (int k = 0; k < previous.size(); k++) {
> >>>   Object arrElement =
> previous.getObject(k);
> >>>   if (arrElement instanceof COSString) {
> >>>   COSString cosString = (COSString)
> >>> arrElement;
> >>>
> >>> stringBuilder.append(cosString.getString());
> >>>   }
> >>>   }
> >>>   String string = stringBuilder.toString();
> >>>   String encodedString = encode(string, font);
> >>>   if(encodedString.contains("To Be Approved")){
> >>>   previous.clear();
> >>>   }
> >>>   }
> >>>
> >>> Note:
> >>> In case of COSArray, I first iterate through the whole array to get the
> >>> whole string before encoding and comparison and this works.
> >>>
> >>> Best Regards,
> >>> a7mad
> >>>
> >>>
> >>>
> >>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
> sahy...@fileaffairs.de
> >>>
> >>> wrote:
> >>>
>  Hi,
> 
>  your text is encoded so within the show text operator Tj the string is
> 
>  7R %H $SSURYHG
> 
>  You wrote that you encode your string to find it - what do you get?
> 
>  BR
>  Maruan
> 
> 
> 
> > Am 23.03.2015 um 22:01 schrieb a7med shre3y  >:
> >
> > Hi Maruan,
> >
> > Here's a link from where you can do

Re: Text removal

2015-03-24 Thread Maruan Sahyoun


> Am 24.03.2015 um 09:40 schrieb a7med shre3y :
> 
> Hi,
> 
> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
> Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
> not the opposite (or at least I don't know). I spent some quite long time
> trying to find out how to find the character codes for the glyphs in the
> currently used font, then I found that it's not an easy task. By the way,
> if you know how to do that, I'd so much appreciate it because I need that
> for replacing text with another text and for that the new text must be
> encoded the same way as the original!
> 
> Back to the text removal, I am able to find the text and also remove it by
> calling reset, as I mentioned in my first email, when I print the output
> content I don't find the text anymore but I still see it when I open the
> file. My first assumption was that there must be some other way to remove
> the text other than the way I am using, and that's what you've actually
> confirmed in your reply, so could you please tell me what still missing?
> 

Could you upload the PDF with the reset text too?

BR
Maruan


> Thanks and regards,
> a7mad
> 
> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun 
> wrote:
> 
>> Hi,
>> 
>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y :
>>> 
>>> Hi,
>>> 
>>> Here's how I do it:
>>> 
>>> 1. I use the following method to encode the text:
>>> 
>>> String encode(String text, PDFont font) throws Exception {
>>>   StringBuilder builder = new StringBuilder();
>>>   byte[] stringBytes = text.getBytes();
>>>   int codeLength = 1;
>>>   for(int i = 0; i < stringBytes.length; i += codeLength){
>>>   String c = font.encode(stringBytes, i, codeLength);
>>>   if(c == null && (i + 1 < stringBytes.length)){
>>>   codeLength++;
>>>   c = font.encode(stringBytes, i, codeLength);
>>>   }
>>>   builder.append(c);
>>>   }
>>>   return builder.toString();
>>>   }
>>> 
>>> 2. Iterating through the tokens, I find the text either it's a COSString
>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
>>> I'm looking for to remove as following:
>>> 
>>> if (op.getOperation().equals("Tj")) {
>>>   COSString previous = (COSString) tokens.get(j
>> -
>>> 1);
>>>   String string = previous.getString();
>>>   String encodedString = encode(string, font);
>> 
>> that string is already encoded. So you'd need to encode "To Be Approved"
>> and compare if that matches the string you are reading from the PDF.
>> 
>>>   if(encodedString.contains("To Be Approved")){
>>>   previous.reset();
>>>   }
>>>   } else if (op.getOperation().equals("TJ")) {
>>>   COSArray previous = (COSArray) tokens.get(j -
>>> 1);
>>>   StringBuilder stringBuilder = new
>>> StringBuilder();
>>>   for (int k = 0; k < previous.size(); k++) {
>>>   Object arrElement = previous.getObject(k);
>>>   if (arrElement instanceof COSString) {
>>>   COSString cosString = (COSString)
>>> arrElement;
>>> 
>>> stringBuilder.append(cosString.getString());
>>>   }
>>>   }
>>>   String string = stringBuilder.toString();
>>>   String encodedString = encode(string, font);
>>>   if(encodedString.contains("To Be Approved")){
>>>   previous.clear();
>>>   }
>>>   }
>>> 
>>> Note:
>>> In case of COSArray, I first iterate through the whole array to get the
>>> whole string before encoding and comparison and this works.
>>> 
>>> Best Regards,
>>> a7mad
>>> 
>>> 
>>> 
>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun >> 
>>> wrote:
>>> 
 Hi,
 
 your text is encoded so within the show text operator Tj the string is
 
 7R %H $SSURYHG
 
 You wrote that you encode your string to find it - what do you get?
 
 BR
 Maruan
 
 
 
> Am 23.03.2015 um 22:01 schrieb a7med shre3y :
> 
> Hi Maruan,
> 
> Here's a link from where you can download the PDF.
> 
> 
 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> 
> Kind Regards,
> a7mad
> 
> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
>> sahy...@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>> you need to upload it to a public location as the mailing list doesn't
>> support attachments.
>> 
>> BR
>> Maruan

Re: Text removal

2015-03-24 Thread Andreas Lehmkühler
Hi,

> a7med shre3y  hat am 23. März 2015 um 15:03
> geschrieben:
> 
> 
> Hi all,
> 
> Currently I am facing a strange problem removing text from the some PDFs.
> My program is able to find the text and "remove it" by calling the
> COSString.reset() method.
> The problem is, when I open the output PDF file, I still see the text but
> not selectable (I mean when I try to highlight it with the mouse to copy
> it, it's not selectable!). When print the content (tokens) of the output
> file, I DO NOT find the text at all!!
> 
> I am currently stuck in the PDF specifications 1.5 and really running out
> of time.
> 
> I'd so much appreciate any help or any idea on what's going on.
> 
> Notes:
> 1. I use use PDFBox 1.7.1
1.7.1 is more than 2 years old (released in july 2012). I strongly recommend to
use a more recent version, such as 1.8.8

BR
Andreas Lehmkühler

> 2. This problem does not occur with all PDFs, only some PDFs cause this
> problem.
> 
> Thank you very much.
> a7mad

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Text removal

2015-03-24 Thread a7med shre3y
Hi,

In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
not the opposite (or at least I don't know). I spent some quite long time
trying to find out how to find the character codes for the glyphs in the
currently used font, then I found that it's not an easy task. By the way,
if you know how to do that, I'd so much appreciate it because I need that
for replacing text with another text and for that the new text must be
encoded the same way as the original!

Back to the text removal, I am able to find the text and also remove it by
calling reset, as I mentioned in my first email, when I print the output
content I don't find the text anymore but I still see it when I open the
file. My first assumption was that there must be some other way to remove
the text other than the way I am using, and that's what you've actually
confirmed in your reply, so could you please tell me what still missing?

Thanks and regards,
a7mad

On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun 
wrote:

> Hi,
>
> > Am 24.03.2015 um 08:14 schrieb a7med shre3y :
> >
> > Hi,
> >
> > Here's how I do it:
> >
> > 1. I use the following method to encode the text:
> >
> > String encode(String text, PDFont font) throws Exception {
> >StringBuilder builder = new StringBuilder();
> >byte[] stringBytes = text.getBytes();
> >int codeLength = 1;
> >for(int i = 0; i < stringBytes.length; i += codeLength){
> >String c = font.encode(stringBytes, i, codeLength);
> >if(c == null && (i + 1 < stringBytes.length)){
> >codeLength++;
> >c = font.encode(stringBytes, i, codeLength);
> >}
> >builder.append(c);
> >}
> >return builder.toString();
> >}
> >
> > 2. Iterating through the tokens, I find the text either it's a COSString
> > ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
> > I'm looking for to remove as following:
> >
> > if (op.getOperation().equals("Tj")) {
> >COSString previous = (COSString) tokens.get(j
> -
> > 1);
> >String string = previous.getString();
> >String encodedString = encode(string, font);
>
> that string is already encoded. So you'd need to encode "To Be Approved"
> and compare if that matches the string you are reading from the PDF.
>
> >if(encodedString.contains("To Be Approved")){
> >previous.reset();
> >}
> >} else if (op.getOperation().equals("TJ")) {
> >COSArray previous = (COSArray) tokens.get(j -
> > 1);
> >StringBuilder stringBuilder = new
> > StringBuilder();
> >for (int k = 0; k < previous.size(); k++) {
> >Object arrElement = previous.getObject(k);
> >if (arrElement instanceof COSString) {
> >COSString cosString = (COSString)
> > arrElement;
> >
> > stringBuilder.append(cosString.getString());
> >}
> >}
> >String string = stringBuilder.toString();
> >String encodedString = encode(string, font);
> >if(encodedString.contains("To Be Approved")){
> >previous.clear();
> >}
> >}
> >
> > Note:
> > In case of COSArray, I first iterate through the whole array to get the
> > whole string before encoding and comparison and this works.
> >
> > Best Regards,
> > a7mad
> >
> >
> >
> > On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun  >
> > wrote:
> >
> >> Hi,
> >>
> >> your text is encoded so within the show text operator Tj the string is
> >>
> >> 7R %H $SSURYHG
> >>
> >> You wrote that you encode your string to find it - what do you get?
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>
> >>> Am 23.03.2015 um 22:01 schrieb a7med shre3y :
> >>>
> >>> Hi Maruan,
> >>>
> >>> Here's a link from where you can download the PDF.
> >>>
> >>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >>>
> >>> Kind Regards,
> >>> a7mad
> >>>
> >>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
> sahy...@fileaffairs.de>
> >>> wrote:
> >>>
>  Hi,
> 
>  you need to upload it to a public location as the mailing list doesn't
>  support attachments.
> 
>  BR
>  Maruan
> 
> > Am 23.03.2015 um 19:18 schrieb a7med shre3y  >:
> >
> > Dear Maruan,
> >
> > Thank you very much for the information. Please find herewith
> att

Re: Text removal

2015-03-24 Thread Maruan Sahyoun
Hi,

> Am 24.03.2015 um 08:14 schrieb a7med shre3y :
> 
> Hi,
> 
> Here's how I do it:
> 
> 1. I use the following method to encode the text:
> 
> String encode(String text, PDFont font) throws Exception {
>StringBuilder builder = new StringBuilder();
>byte[] stringBytes = text.getBytes();
>int codeLength = 1;
>for(int i = 0; i < stringBytes.length; i += codeLength){
>String c = font.encode(stringBytes, i, codeLength);
>if(c == null && (i + 1 < stringBytes.length)){
>codeLength++;
>c = font.encode(stringBytes, i, codeLength);
>}
>builder.append(c);
>}
>return builder.toString();
>}
> 
> 2. Iterating through the tokens, I find the text either it's a COSString
> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
> I'm looking for to remove as following:
> 
> if (op.getOperation().equals("Tj")) {
>COSString previous = (COSString) tokens.get(j -
> 1);
>String string = previous.getString();
>String encodedString = encode(string, font);

that string is already encoded. So you'd need to encode "To Be Approved" and 
compare if that matches the string you are reading from the PDF.

>if(encodedString.contains("To Be Approved")){
>previous.reset();
>}
>} else if (op.getOperation().equals("TJ")) {
>COSArray previous = (COSArray) tokens.get(j -
> 1);
>StringBuilder stringBuilder = new
> StringBuilder();
>for (int k = 0; k < previous.size(); k++) {
>Object arrElement = previous.getObject(k);
>if (arrElement instanceof COSString) {
>COSString cosString = (COSString)
> arrElement;
> 
> stringBuilder.append(cosString.getString());
>}
>}
>String string = stringBuilder.toString();
>String encodedString = encode(string, font);
>if(encodedString.contains("To Be Approved")){
>previous.clear();
>}
>}
> 
> Note:
> In case of COSArray, I first iterate through the whole array to get the
> whole string before encoding and comparison and this works.
> 
> Best Regards,
> a7mad
> 
> 
> 
> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun 
> wrote:
> 
>> Hi,
>> 
>> your text is encoded so within the show text operator Tj the string is
>> 
>> 7R %H $SSURYHG
>> 
>> You wrote that you encode your string to find it - what do you get?
>> 
>> BR
>> Maruan
>> 
>> 
>> 
>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y :
>>> 
>>> Hi Maruan,
>>> 
>>> Here's a link from where you can download the PDF.
>>> 
>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>> 
>>> Kind Regards,
>>> a7mad
>>> 
>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun 
>>> wrote:
>>> 
 Hi,
 
 you need to upload it to a public location as the mailing list doesn't
 support attachments.
 
 BR
 Maruan
 
> Am 23.03.2015 um 19:18 schrieb a7med shre3y :
> 
> Dear Maruan,
> 
> Thank you very much for the information. Please find herewith attached
 the PDF to reproduce the problem.
> The text to remove is: "To Be Approved". The text has a multi-byte
 encoding, so I call first to encode it in order to find it then remove
>> it.
> 
> Best Regards,
> a7mad
> 
>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
>> sahy...@fileaffairs.de>
 wrote:
>> Dear a7mad,
>> 
>> removing text from a PDF is not an easy task as
>> - text which might visually appear as a single item might consistent
>> of
 individual parts within the PDF itself e.g. each character or groups of
 characters are place individually in different COSStrings
>> - text might be drawn using graphics commands
>> - text can appear within different parts of the PDF (e.g. the text
 might be content of a form field AND the annotation representing the
>> form
 field visually)
>> - you need to look up the encoding information to get form the
 characters in the PDF "string" to the ones you are looking for
>> ….
>> 
>> If you can post a specific PDF to a public location and describe in
 detail which string should have been replaced which hasn't I will be
>> able
 to tell you why that might have happened.
>> 
>> Maruan
>> 
>> 
>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y >> :
>>> 
>>> Hi all,
>>> 
>>> Currently I 

Re: Text removal

2015-03-24 Thread a7med shre3y
Hi,

Here's how I do it:

1. I use the following method to encode the text:

String encode(String text, PDFont font) throws Exception {
StringBuilder builder = new StringBuilder();
byte[] stringBytes = text.getBytes();
int codeLength = 1;
for(int i = 0; i < stringBytes.length; i += codeLength){
String c = font.encode(stringBytes, i, codeLength);
if(c == null && (i + 1 < stringBytes.length)){
codeLength++;
c = font.encode(stringBytes, i, codeLength);
}
builder.append(c);
}
return builder.toString();
}

2. Iterating through the tokens, I find the text either it's a COSString
("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
I'm looking for to remove as following:

if (op.getOperation().equals("Tj")) {
COSString previous = (COSString) tokens.get(j -
1);
String string = previous.getString();
String encodedString = encode(string, font);
if(encodedString.contains("To Be Approved")){
previous.reset();
}
} else if (op.getOperation().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j -
1);
StringBuilder stringBuilder = new
StringBuilder();
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString)
arrElement;

stringBuilder.append(cosString.getString());
}
}
String string = stringBuilder.toString();
String encodedString = encode(string, font);
if(encodedString.contains("To Be Approved")){
previous.clear();
}
}

Note:
In case of COSArray, I first iterate through the whole array to get the
whole string before encoding and comparison and this works.

Best Regards,
a7mad



On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun 
wrote:

> Hi,
>
> your text is encoded so within the show text operator Tj the string is
>
> 7R %H $SSURYHG
>
> You wrote that you encode your string to find it - what do you get?
>
> BR
> Maruan
>
>
>
> > Am 23.03.2015 um 22:01 schrieb a7med shre3y :
> >
> > Hi Maruan,
> >
> > Here's a link from where you can download the PDF.
> >
> >
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >
> > Kind Regards,
> > a7mad
> >
> > On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun 
> > wrote:
> >
> >> Hi,
> >>
> >> you need to upload it to a public location as the mailing list doesn't
> >> support attachments.
> >>
> >> BR
> >> Maruan
> >>
> >>> Am 23.03.2015 um 19:18 schrieb a7med shre3y :
> >>>
> >>> Dear Maruan,
> >>>
> >>> Thank you very much for the information. Please find herewith attached
> >> the PDF to reproduce the problem.
> >>> The text to remove is: "To Be Approved". The text has a multi-byte
> >> encoding, so I call first to encode it in order to find it then remove
> it.
> >>>
> >>> Best Regards,
> >>> a7mad
> >>>
>  On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
> sahy...@fileaffairs.de>
> >> wrote:
>  Dear a7mad,
> 
>  removing text from a PDF is not an easy task as
>  - text which might visually appear as a single item might consistent
> of
> >> individual parts within the PDF itself e.g. each character or groups of
> >> characters are place individually in different COSStrings
>  - text might be drawn using graphics commands
>  - text can appear within different parts of the PDF (e.g. the text
> >> might be content of a form field AND the annotation representing the
> form
> >> field visually)
>  - you need to look up the encoding information to get form the
> >> characters in the PDF "string" to the ones you are looking for
>  ….
> 
>  If you can post a specific PDF to a public location and describe in
> >> detail which string should have been replaced which hasn't I will be
> able
> >> to tell you why that might have happened.
> 
>  Maruan
> 
> 
> > Am 23.03.2015 um 15:03 schrieb a7med shre3y  >:
> >
> > Hi all,
> >
> > Currently I am facing a strange problem removing text from the some
> >> PDFs.
> > My program is able to find the text and "remove it" by calling the
> > COSString.reset() method.
> > The problem is, when I open the output PDF file, I still see the text
> >> but
> > not selectable (I mean when I try to highlight it with the mouse to
> >> cop