Hi,

Tweaking the parameters didn’t help. 
Here is a part of the pdf in question - 
https://dl.dropboxusercontent.com/u/2456015/problem.pdf

Best regards,
Augusto

> On 31 May 2016, at 22:44, Tilman Hausherr <thaush...@t-online.de> wrote:
> 
> Looks like a different problem. Assuming you're using the latest version, you 
> might want to try setting
> 
> PDFTextStripper.setSpacingTolerance()
> 
> the default is 0.5f
> 
> So try some values slightly above or below, i.e. 0.4f, 0.6f, etc.
> 
> another one is
> 
> setAverageCharTolerance()
> 
> the default is 0.3f.
> 
> Tilman
> 
> Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva:
>> Hi,
>> 
>> PDFDebugger shows the following.
>>  (The ) Tj
>>   22.7679 0 Td
>>   (es t) Tj
>>   12.2023 0 Td
>>   (ab lis) Tj
>>   20.7981 0 Td
>>   (h m) Tj
>>   14.0054 0 Td
>>   (ent ) Tj
>>   19.1013 0 Td
>>   (of ) Tj
>>   14.83369 0 Td
>>   (an ) Tj
>>   16.0359 0 Td
>>   (in te gr) Tj
>>   25.72701 0 Td
>>   (ate) Tj
>>   12.80299 0 Td
>>   (d ) Tj
>> 
>> I am not sure if it is the same problem. I will try to get permission to 
>> upload the document somewhere tomorrow.
>> 
>> Best regards,
>> Augusto
>> 
>>> On 31 May 2016, at 18:23, Tilman Hausherr <thaush...@t-online.de> wrote:
>>> 
>>> Please upload the file somewhere. If you've used PDFDebugger before, have a 
>>> look here:
>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>> and then look at your content stream whether it is the same problem.
>>> 
>>> Tilman
>>> 
>>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>>>> Hi all,
>>>> 
>>>> I am using PDFBox java library to read the content of some PDFs and it 
>>>> seems like it inserts some weird (hyphen-like) spacing. I get the same 
>>>> result using the PDFBox-App command line util.
>>>> 
>>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age 
>>>> ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>>>> 
>>>> I tried to extract text from the same PDF using the pdftotext command line 
>>>> utility it extracts the text correctly:
>>>> The establishment of an integrated Partner Relationship Management (PRM) 
>>>> system can potentially address several aspects
>>>> 
>>>> Does somebody have any idea why PDFBox behaves in this way and any tips to 
>>>> fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF 
>>>> processing underneath.
>>>> 
>>>> Best regards,
>>>> Augusto
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to