Re: Issues with extraction content of PDF files

John Hewson Thu, 31 Dec 2015 18:44:02 -0800

> On 29 Dec 2015, at 00:34, Zheng Lin Edwin Yeo <[email protected]> wrote:
> 
> Thanks for your reply Tilman.
> 
> Would like to find out, is the content extraction issue of this caused by the 
> Identity-H encoding?


Most likely. Identity-H is basically just "no encoding", so there needs to be a 
ToUnicode  map in order to extract the text (which there isn't).

-- John

> Regards,
> Edwin
> 
> 
>> On 21 December 2015 at 16:12, Tilman Hausherr <[email protected]> wrote:
>>> Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
>>> Thanks for your reply.
>>> 
>>> I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
>>> on Adobe Reader then it is not able to extract all the text properly.
>>> 
>>> Is there anyway which we can check what type of encoding is used for the
>>> PDF files?
>> 
>> Yes, in the font dictionaries, as you can see from this screenshot:
>> 
>> 
>> 
>> However this won't get you the text, obviously.
>> 
>> Tilman
>> 
>>> Regards,
>>> Edwin
>>> 
>>> 
>>> 
>>> 
>>> On 19 December 2015 at 03:07, Tilman Hausherr <[email protected]> wrote:
>>> 
>>>>> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>>>>> 
>>>>> I've shared one of the file with the issue on dropbox, which you can
>>>>> access
>>>>> via the link here:
>>>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>> 
>>>> Adobe Reader is also unable to extract text.
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>> 
>

Re: Issues with extraction content of PDF files

Reply via email to