Re: Issues when indexing PDF files

Walter Underwood Thu, 17 Dec 2015 07:59:12 -0800

PDF isn’t really text. For example, it doesn’t have spaces, it just moves the 
next letter over farther. Letters might not be in reading order — two column 
text could be printed as horizontal scans. Custom fonts might not use an 
encoding that matches Unicode, which makes them encrypted (badly). And so on.


As one of my coworkers said, trying to turn a PDF into structured text is like 
trying to turn hamburger back into a cow.

PDF is where text goes to die.

Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 17, 2015, at 2:48 AM, Charlie Hull <char...@flax.co.uk> wrote:
> 
> On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
>> Hi Alexandre,
>> 
>> Thanks for your reply.
>> 
>> So the only way to solve this issue is to explore with PDF specific tools
>> and change the encoding of the file?
>> Is there any way to configure it in Solr?
> 
> Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
> in a way that Tika cannot easily extract the text, there's nothing you can do 
> in Solr that will help.
> 
> Unfortunately PDF isn't a content format but a presentation format - so 
> extracting plain text is fraught with difficulty. You may see a character on 
> a PDF page, but exactly how that character is generated (using a specific 
> encoding, font, or even by drawing a picture) is outside your control. There 
> are various businesses built on this premise - they charge for creating clean 
> extracted text from PDFs - and even they have trouble with some PDFs.
> 
> HTH
> 
> Charlie
> 
>> 
>> Regards,
>> Edwin
>> 
>> 
>> On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com>
>> wrote:
>> 
>>> They could be using custom fonts and non-Unicode characters. That's
>>> probably something to explore with PDF specific tools.
>>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
>>> wrote:
>>> 
>>>> I've checked all the files which has problem with the content in the Solr
>>>> index using the Tika app. All of them shows the same issues as what I see
>>>> in the Solr index.
>>>> 
>>>> So does the issues lies with the encoding of the file? Are we able to
>>> check
>>>> the encoding of the file?
>>>> 
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> 
>>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Erik,
>>>>> 
>>>>> I've shared the file on dropbox, which you can access via the link
>>> here:
>>>>> 
>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>> 
>>>>> This is what I get from the Tika app after dropping the file in.
>>>>> 
>>>>> Content-Length: 75092
>>>>> Content-Type: application/pdf
>>>>> Type: COSName{Info}
>>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
>>>>> X-TIKA:digest:SHA256:
>>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
>>>>> access_permission:assemble_document: true
>>>>> access_permission:can_modify: true
>>>>> access_permission:can_print: true
>>>>> access_permission:can_print_degraded: true
>>>>> access_permission:extract_content: true
>>>>> access_permission:extract_for_accessibility: true
>>>>> access_permission:fill_in_form: true
>>>>> access_permission:modify_annotations: true
>>>>> dc:format: application/pdf; version=1.3
>>>>> pdf:PDFVersion: 1.3
>>>>> pdf:encrypted: false
>>>>> producer: null
>>>>> resourceName: Desmophen+670+BAe.pdf
>>>>> xmpTPg:NPages: 3
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>>> 
>>>>> 
>>>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Edwin - Can you share one of those PDF files?
>>>>>> 
>>>>>> Also, drop the file into the Tika app and see what it sees directly -
>>>> get
>>>>>> the tika-app JAR and run that desktop application.
>>>>>> 
>>>>>> Could be an encoding issue?
>>>>>> 
>>>>>>         Erik
>>>>>> 
>>>>>> —
>>>>>> Erik Hatcher, Senior Solutions Architect
>>>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>>>> edwinye...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm using Solr 5.3.0
>>>>>>> 
>>>>>>> I'm indexing some PDF documents. However, for certain PDF files,
>>> there
>>>>>> are
>>>>>>> chinese text in the documents, but after indexing, what is indexed
>>> in
>>>>>> the
>>>>>>> content is either a series of "??????" or an empty content.
>>>>>>> 
>>>>>>> I'm using the post.jar that comes together with Solr.
>>>>>>> 
>>>>>>> What could be the reason that causes this?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> -- 
> Charlie Hull
> Flax - Open Source Enterprise Search
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk

Re: Issues when indexing PDF files

Reply via email to