Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj
Hi Matthew, Thanks for your reply,the partial content of my pdf file as the following and the attached is my pdf file,please help have a look. %PDF-1.5 %档档 1 0 obj <>>> endobj 2 0 obj <> endobj 3 0 obj <>/ExtGState<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>> endobj 4 0 obj <> stream x湹怟k翤咓?蜶焦笫L d憲( #]B5?m?4?矆-H{w鐁8||垶惁血槜?C^8K聊??0\馊h?褁?H顸€N娂??搵?ジ?鼼i蝠.?h搡趌?鞧?嶖誻淚駫暌鋖M烅?え?╒0@T裷叩?呵韋z??1賦鼌y厸淾徲煿?猻?{%錚云腧O婵<鶚'孹? 蕏? endstream endobj 5 0 obj <> endobj 6 0 obj [ 7 0 R] endobj 7 0 obj <> endobj 8 0 obj <> endobj 9 0 obj <> endobj 10 0 obj <> endobj Kindly regards, Alex 在 2022-04-12 20:15:50,"Matthew Brincke via Podofo-users" 写道: On Tuesday, 12.04.2022 at 18:34 +0800 Alex wrote: Hi, When I opened a pdf file using podofobrowser.exe,if a pdfobject has a stream object,podofobrowser.exe will show the content of the stream as the following: BT /F2 10.56 Tf 1 0 0 1 136.46 758.28 Tm 0 g 0 G [(pdf)] TJ ET BT /F1 10.56 Tf 1 0 0 1 154.94 758.28 Tm 0 g 0 G [<08CF372D>] TJ ET In the first BT object,I know easily the text string is “pdf “ by ([(pdf)] TJ),but it is difficult to understand [<08CF372D>] TJ in the second BT object,could someone tell me how to understand [<08CF372D>], what encode type is this. Hi Alex, the text is given as a HexString (correct), but the encoding of it is given by the encoding of the font referenced as the name /F1 in your PDF snippet (operator Tf). So if you'd like more information, I'd need the content of the PDF object whose reference is given after the name /F1 in the /Font property of your PDF's /ProcSet, and also the contents of the objects whose references are given after /Encoding and /FontDescriptor in that object (long arrays can be abbreviated). Thanks in advance. Pleased to read from you again. Thanks, Alex Best regards, mabri daochu.pdf Description: Adobe PDF document ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj
On Tue, 12 Apr 2022 at 18:09, Michal Sudolsky wrote: > > Just note that text position really does not depend on "m" or "l" operators > like that code may misleadingly suggest (correct me if I am wrong): > You are 100% correct. ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj
On Tue, Apr 12, 2022 at 5:24 PM Francesco Pretto wrote: > On Tue, 12 Apr 2022 at 14:50, zyx wrote: > > there exists a text extract tool [1], which is supposed to, well, extract > > text from the PDF files. > > [1] > https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/branches/PODOFO_0_9_7_BRANCH/tools/podofotxtextract/ > > > > Correct: albeit many text related operators are not handled, that is > the code to look in PoDoFo. > > Just note that text position really does not depend on "m" or "l" operators like that code may misleadingly suggest (correct me if I am wrong): if( strcmp( pszToken, "l" ) == 0 || strcmp( pszToken, "m" ) == 0 ) { if( stack.size() == 2 ) { dCurPosX = stack.top().GetReal(); stack.pop(); dCurPosY = stack.top().GetReal(); stack.pop(); > Cheers, > Francesco > > > ___ > Podofo-users mailing list > Podofo-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/podofo-users > ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj
On Tue, 12 Apr 2022 at 14:50, zyx wrote: > there exists a text extract tool [1], which is supposed to, well, extract > text from the PDF files. > [1] > https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/branches/PODOFO_0_9_7_BRANCH/tools/podofotxtextract/ > Correct: albeit many text related operators are not handled, that is the code to look in PoDoFo. Cheers, Francesco ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj
On Tue, 2022-04-12 at 14:16 +0200, Francesco Pretto wrote: > It's a complex task and PoDoFo doesn't expose a high level API to > perform such text extraction. Also the handling of the different > predefined/custom encodings that the PDF standard allows to use or > define is incomplete and sometimes buggy. Hi, while I cannot speak of the accuracy or completeness of the code, there exists a text extract tool [1], which is supposed to, well, extract text from the PDF files. It can give at least an idea of what to do using the low level API of PoDoFo. Bye, zyx [1] https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/branches/PODOFO_0_9_7_BRANCH/tools/podofotxtextract/ ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj
Hello, <08CF372D> is an hexadecimal string, which is basically a hex encoded representation of a char/byte array. The exact encoding of this byte array is specified in the F1 font /Encoding key. PDF standard has optimizations to draw the glyphs representing the text as fast as possibile. Because of this reason, the logical text often can't be retrieved directly from from the TJ/Tj operators, and must be mapped to Unicode code points by using the /ToUnicode map of the font. It's also possible that the logical text can be reconstructed only by geometrical considerations, such as finding chunks of the string in the proximities and geometrically within the same line. It's a complex task and PoDoFo doesn't expose a high level API to perform such text extraction. Also the handling of the different predefined/custom encodings that the PDF standard allows to use or define is incomplete and sometimes buggy. A work is being done to expose a new API for text extraction that is working quite well. The API is to be expected to be introduced first in pdfmm (a fork of PoDoFo), with a proposed plan to merge it back to PoDoFo together all the required enhancements to handling of PDF encodings. Regards, Francesco On Tue, 12 Apr 2022 at 12:35, Alex wrote: > > Hi, > > When I opened a pdf file using podofobrowser.exe,if a pdfobject has a > stream object,podofobrowser.exe will show the content of the stream as the > following: > > > BT > > /F2 10.56 Tf > > 1 0 0 1 136.46 758.28 Tm > > 0 g > > 0 G > > [(pdf)] TJ > > ET > > > BT > > /F1 10.56 Tf > > 1 0 0 1 154.94 758.28 Tm > > 0 g > > 0 G > > [<08CF372D>] TJ > > ET > > > > In the first BT object,I know easily the text string is “pdf “ by ([(pdf)] > TJ),but it is difficult to understand [<08CF372D>] TJ in the second BT > object,could someone tell me how to understand [<08CF372D>],what encode type > is this. > > > > > Thanks, > > > Alex > > > > > > ___ > Podofo-users mailing list > Podofo-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Could someone tell me the encode type of HexString followed by Tj
On Tuesday, 12.04.2022 at 18:34 +0800 Alex wrote: > Hi, > > When I opened a pdf file using podofobrowser.exe,if a pdfobject > has a stream object,podofobrowser.exe will show the content of the > stream as the following: > > > BT > /F2 10.56 Tf > 1 0 0 1 136.46 758.28 Tm > 0 g > 0 G > [(pdf)] TJ > ET > > BT > /F1 10.56 Tf > 1 0 0 1 154.94 758.28 Tm > 0 g > 0 G > [<08CF372D>] TJ > ET > > > In the first BT object,I know easily the text string is “pdf “ by > ([(pdf)] TJ),but it is difficult to understand [<08CF372D>] TJ in the > second BT object,could someone tell me how to understand > [<08CF372D>], what encode type is this. Hi Alex, the text is given as a HexString (correct), but the encoding of it is given by the encoding of the font referenced as the name /F1 in your PDF snippet (operator Tf). So if you'd like more information, I'd need the content of the PDF object whose reference is given after the name /F1 in the /Font property of your PDF's /ProcSet, and also the contents of the objects whose references are given after /Encoding and /FontDescriptor in that object (long arrays can be abbreviated). Thanks in advance. Pleased to read from you again. > Thanks, > > Alex > Best regards, mabri ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
[Podofo-users] Could someone tell me the encode type of HexString followed by Tj
Hi, When I opened a pdf file using podofobrowser.exe,if a pdfobject has a stream object,podofobrowser.exe will show the content of the stream as the following: BT /F2 10.56 Tf 1 0 0 1 136.46 758.28 Tm 0 g 0 G [(pdf)] TJ ET BT /F1 10.56 Tf 1 0 0 1 154.94 758.28 Tm 0 g 0 G [<08CF372D>] TJ ET In the first BT object,I know easily the text string is “pdf “ by ([(pdf)] TJ),but it is difficult to understand [<08CF372D>] TJ in the second BT object,could someone tell me how to understand [<08CF372D>],what encode type is this. Thanks, Alex___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users