Hi

Thanks for the help. I tried that fix out in a snapshot of my own and it seems 
to fix it.

I am afraid I can’t help you with the naming :) Both seem fine but I guess you 
need to know what it means because it is hard to find out through the name.

Best regards,
Augusto

> On 01 Jun 2016, at 17:52, Tilman Hausherr <thaush...@t-online.de> wrote:
> 
> Ignore what I wrote yesterday evening. Your content stream excerpt shows that 
> the spaces are already there. Using Adobe Reader shows the same problem. Your 
> file is similar to
> https://issues.apache.org/jira/browse/PDFBOX-3248
> and I just tested the solution I mentioned there, and here's the result:
> 
> ===
> losses equitably and the outcome of the collaboration must be quantifiably
> beneficial to everyone. The objective is to maximise benefits while mini-
> mising costs.
> ===
> 
> What I could do is this: add the logic mentioned in that issue as an option, 
> that is disabled by default. But I won't do it today, because a release is 
> planned. You could use a snapshot, or build yourself.
> 
> Another problem is that I can't come up with a name
> 
> setIgnoreHardSpaces ?
> 
> setFullSpacesHeuristics ?
> 
> Tilman
> 
> 
> 
> 
> Am 01.06.2016 um 13:59 schrieb Augusto Ribeiro Silva:
>> Hi,
>> 
>> Tweaking the parameters didn’t help.
>> Here is a part of the pdf in question - 
>> https://dl.dropboxusercontent.com/u/2456015/problem.pdf
>> 
>> Best regards,
>> Augusto
>> 
>>> On 31 May 2016, at 22:44, Tilman Hausherr <thaush...@t-online.de> wrote:
>>> 
>>> Looks like a different problem. Assuming you're using the latest version, 
>>> you might want to try setting
>>> 
>>> PDFTextStripper.setSpacingTolerance()
>>> 
>>> the default is 0.5f
>>> 
>>> So try some values slightly above or below, i.e. 0.4f, 0.6f, etc.
>>> 
>>> another one is
>>> 
>>> setAverageCharTolerance()
>>> 
>>> the default is 0.3f.
>>> 
>>> Tilman
>>> 
>>> Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva:
>>>> Hi,
>>>> 
>>>> PDFDebugger shows the following.
>>>>  (The ) Tj
>>>>   22.7679 0 Td
>>>>   (es t) Tj
>>>>   12.2023 0 Td
>>>>   (ab lis) Tj
>>>>   20.7981 0 Td
>>>>   (h m) Tj
>>>>   14.0054 0 Td
>>>>   (ent ) Tj
>>>>   19.1013 0 Td
>>>>   (of ) Tj
>>>>   14.83369 0 Td
>>>>   (an ) Tj
>>>>   16.0359 0 Td
>>>>   (in te gr) Tj
>>>>   25.72701 0 Td
>>>>   (ate) Tj
>>>>   12.80299 0 Td
>>>>   (d ) Tj
>>>> 
>>>> I am not sure if it is the same problem. I will try to get permission to 
>>>> upload the document somewhere tomorrow.
>>>> 
>>>> Best regards,
>>>> Augusto
>>>> 
>>>>> On 31 May 2016, at 18:23, Tilman Hausherr <thaush...@t-online.de> wrote:
>>>>> 
>>>>> Please upload the file somewhere. If you've used PDFDebugger before, have 
>>>>> a look here:
>>>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>>>> and then look at your content stream whether it is the same problem.
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>>>>>> Hi all,
>>>>>> 
>>>>>> I am using PDFBox java library to read the content of some PDFs and it 
>>>>>> seems like it inserts some weird (hyphen-like) spacing. I get the same 
>>>>>> result using the PDFBox-App command line util.
>>>>>> 
>>>>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age 
>>>>>> ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>>>>>> 
>>>>>> I tried to extract text from the same PDF using the pdftotext command 
>>>>>> line utility it extracts the text correctly:
>>>>>> The establishment of an integrated Partner Relationship Management (PRM) 
>>>>>> system can potentially address several aspects
>>>>>> 
>>>>>> Does somebody have any idea why PDFBox behaves in this way and any tips 
>>>>>> to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for 
>>>>>> PDF processing underneath.
>>>>>> 
>>>>>> Best regards,
>>>>>> Augusto
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to