Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread Alex



Hi Matthew,


  Thanks for your reply,the partial content of my pdf file as the following 
and the attached is my pdf file,please help have a look.


%PDF-1.5
%档档
1 0 obj
<>>>
endobj
2 0 obj
<>
endobj
3 0 obj
<>/ExtGState<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] 
/Contents 4 0 
R/Group<>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<>
stream
x湹怟k翤咓?蜶焦笫L d憲(
#]B5?m?4?矆-H{w鐁8||垶惁血槜?C^8K聊??0\馊h?褁?H顸€N娂??搵?ジ?鼼i蝠.?h搡趌?鞧?嶖誻淚駫暌鋖M烅?え?╒0@T裷叩?呵韋z??1賦鼌y厸淾徲煿?猻?{%錚云腧O婵<鶚'孹?
 蕏?
endstream
endobj
5 0 obj
<>
endobj
6 0 obj
[ 7 0 R] 
endobj
7 0 obj
<>
endobj
8 0 obj
<>
endobj
9 0 obj
<>
endobj
10 0 obj
<>
endobj








 Kindly regards,

   Alex




在 2022-04-12 20:15:50,"Matthew Brincke via Podofo-users" 
 写道:

On Tuesday, 12.04.2022 at 18:34 +0800 Alex wrote:
Hi,


When I opened a pdf file using podofobrowser.exe,if a pdfobject has a 
stream object,podofobrowser.exe will show the content of the stream as the 
following:




BT
/F2 10.56 Tf
1 0 0 1 136.46 758.28 Tm
0 g
0 G
[(pdf)] TJ
ET


BT
/F1 10.56 Tf
1 0 0 1 154.94 758.28 Tm
0 g
0 G
[<08CF372D>] TJ
ET




In the first BT object,I know easily the text string is “pdf “ by ([(pdf)] 
TJ),but it is difficult to understand [<08CF372D>] TJ in the second BT 
object,could someone tell me how to understand [<08CF372D>], what encode type 
is this.


Hi Alex,


the text is given as a HexString (correct), but the encoding of it
is given by the encoding of the font referenced as the name /F1 in
your PDF snippet (operator Tf). So if you'd like more information,
I'd need the content of the PDF object whose reference is given
after the name /F1 in the /Font property of your PDF's /ProcSet,
and also the contents of the objects whose references are given
after /Encoding and /FontDescriptor in that object (long arrays can 
be abbreviated). Thanks in advance. Pleased to read from you again.
 
 Thanks,

  Alex




Best regards, mabri

daochu.pdf
Description: Adobe PDF document
___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread Francesco Pretto
On Tue, 12 Apr 2022 at 18:09, Michal Sudolsky  wrote:
>
> Just note that text position really does not depend on "m" or "l" operators 
> like that code may misleadingly suggest (correct me if I am wrong):
>

You are 100% correct.


___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread Michal Sudolsky
On Tue, Apr 12, 2022 at 5:24 PM Francesco Pretto  wrote:

> On Tue, 12 Apr 2022 at 14:50, zyx  wrote:
> > there exists a text extract tool [1], which is supposed to, well, extract
> > text from the PDF files.
> > [1]
> https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/branches/PODOFO_0_9_7_BRANCH/tools/podofotxtextract/
> >
>
> Correct: albeit many text related operators are not handled, that is
> the code to look in PoDoFo.
>
>
Just note that text position really does not depend on "m" or "l" operators
like that code may misleadingly suggest (correct me if I am wrong):

if( strcmp( pszToken, "l" ) == 0 ||
strcmp( pszToken, "m" ) == 0 )
{
if( stack.size() == 2 )
{
dCurPosX = stack.top().GetReal();
stack.pop();
dCurPosY = stack.top().GetReal();
stack.pop();


> Cheers,
> Francesco
>
>
> ___
> Podofo-users mailing list
> Podofo-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/podofo-users
>
___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread Francesco Pretto
On Tue, 12 Apr 2022 at 14:50, zyx  wrote:
> there exists a text extract tool [1], which is supposed to, well, extract
> text from the PDF files.
> [1] 
> https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/branches/PODOFO_0_9_7_BRANCH/tools/podofotxtextract/
>

Correct: albeit many text related operators are not handled, that is
the code to look in PoDoFo.

Cheers,
Francesco


___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread zyx
On Tue, 2022-04-12 at 14:16 +0200, Francesco Pretto wrote:
> It's a complex task and PoDoFo doesn't expose a high level API to
> perform such text extraction. Also the handling of the different
> predefined/custom encodings that the PDF standard allows to use or
> define is incomplete and sometimes buggy.

Hi,
while I cannot speak of the accuracy or completeness of the code, there
exists a text extract tool [1], which is supposed to, well, extract
text from the PDF files. It can give at least an idea of what to do
using the low level API of PoDoFo.
Bye,
zyx

[1] 
https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/branches/PODOFO_0_9_7_BRANCH/tools/podofotxtextract/


___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread Francesco Pretto
Hello,

<08CF372D> is an hexadecimal string, which is basically a hex encoded
representation of a char/byte array. The exact encoding of this byte
array is specified in the F1 font /Encoding key. PDF standard has
optimizations to draw the glyphs representing the text as fast as
possibile. Because of this reason, the logical text often can't be
retrieved directly from from the TJ/Tj operators, and must be mapped
to Unicode code points by using the /ToUnicode map of the font. It's
also possible that the logical text can be reconstructed only by
geometrical considerations, such as finding chunks of the string in
the proximities and geometrically within the same line. It's a complex
task and PoDoFo doesn't expose a high level API to perform such text
extraction. Also the handling of the different predefined/custom
encodings that the PDF standard allows to use or define is incomplete
and sometimes buggy. A work is being done to expose a new API for text
extraction that is working quite well. The API is to be expected to be
introduced first in pdfmm (a fork of PoDoFo), with a proposed plan to
merge it back to PoDoFo together all the required enhancements to
handling of PDF encodings.

Regards,
Francesco

On Tue, 12 Apr 2022 at 12:35, Alex  wrote:
>
> Hi,
>
> When I opened a pdf file using podofobrowser.exe,if a pdfobject has a 
> stream object,podofobrowser.exe will show the content of the stream as the 
> following:
>
>
> BT
>
> /F2 10.56 Tf
>
> 1 0 0 1 136.46 758.28 Tm
>
> 0 g
>
> 0 G
>
> [(pdf)] TJ
>
> ET
>
>
> BT
>
> /F1 10.56 Tf
>
> 1 0 0 1 154.94 758.28 Tm
>
> 0 g
>
> 0 G
>
> [<08CF372D>] TJ
>
> ET
>
>
>
> In the first BT object,I know easily the text string is “pdf “ by ([(pdf)] 
> TJ),but it is difficult to understand [<08CF372D>] TJ in the second BT 
> object,could someone tell me how to understand [<08CF372D>],what encode type 
> is this.
>
>
>
>   
>  Thanks,
>
>   
> Alex
>
>
>
>
>
> ___
> Podofo-users mailing list
> Podofo-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/podofo-users


___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread Matthew Brincke via Podofo-users
On Tuesday, 12.04.2022 at 18:34 +0800 Alex wrote:
> Hi,
> 
>     When I opened a pdf file using podofobrowser.exe,if a pdfobject
> has a stream object,podofobrowser.exe will show the content of the
> stream as the following:
> 
> 
> BT
> /F2 10.56 Tf
> 1 0 0 1 136.46 758.28 Tm
> 0 g
> 0 G
> [(pdf)] TJ
> ET
> 
> BT
> /F1 10.56 Tf
> 1 0 0 1 154.94 758.28 Tm
> 0 g
> 0 G
> [<08CF372D>] TJ
> ET
> 
> 
> In the first BT object,I know easily the text string is “pdf “ by
> ([(pdf)] TJ),but it is difficult to understand [<08CF372D>] TJ in the
> second BT object,could someone tell me how to understand
> [<08CF372D>], what encode type is this.

Hi Alex,

the text is given as a HexString (correct), but the encoding of it
is given by the encoding of the font referenced as the name /F1 in
your PDF snippet (operator Tf). So if you'd like more information,
I'd need the content of the PDF object whose reference is given
after the name /F1 in the /Font property of your PDF's /ProcSet,
and also the contents of the objects whose references are given
after /Encoding and /FontDescriptor in that object (long arrays can 
be abbreviated). Thanks in advance. Pleased to read from you again.
                                                                     
>                              Thanks,
>                                                                      
>                                 Alex
> 

Best regards, mabri

___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


[Podofo-users] Could someone tell me the encode type of HexString followed by Tj

2022-04-12 Thread Alex
Hi,


When I opened a pdf file using podofobrowser.exe,if a pdfobject has a 
stream object,podofobrowser.exe will show the content of the stream as the 
following:






BT

/F2 10.56 Tf

1 0 0 1 136.46 758.28 Tm

0 g

0 G

[(pdf)] TJ

ET




BT

/F1 10.56 Tf

1 0 0 1 154.94 758.28 Tm

0 g

0 G

[<08CF372D>] TJ

ET







In the first BT object,I know easily the text string is “pdf “ by ([(pdf)] 
TJ),but it is difficult to understand [<08CF372D>] TJ in the second BT 
object,could someone tell me how to understand [<08CF372D>],what encode type is 
this.








   Thanks,


  Alex___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users