Hi Thanks for the help. I tried that fix out in a snapshot of my own and it seems to fix it.
I am afraid I can’t help you with the naming :) Both seem fine but I guess you need to know what it means because it is hard to find out through the name. Best regards, Augusto > On 01 Jun 2016, at 17:52, Tilman Hausherr <thaush...@t-online.de> wrote: > > Ignore what I wrote yesterday evening. Your content stream excerpt shows that > the spaces are already there. Using Adobe Reader shows the same problem. Your > file is similar to > https://issues.apache.org/jira/browse/PDFBOX-3248 > and I just tested the solution I mentioned there, and here's the result: > > === > losses equitably and the outcome of the collaboration must be quantifiably > beneficial to everyone. The objective is to maximise benefits while mini- > mising costs. > === > > What I could do is this: add the logic mentioned in that issue as an option, > that is disabled by default. But I won't do it today, because a release is > planned. You could use a snapshot, or build yourself. > > Another problem is that I can't come up with a name > > setIgnoreHardSpaces ? > > setFullSpacesHeuristics ? > > Tilman > > > > > Am 01.06.2016 um 13:59 schrieb Augusto Ribeiro Silva: >> Hi, >> >> Tweaking the parameters didn’t help. >> Here is a part of the pdf in question - >> https://dl.dropboxusercontent.com/u/2456015/problem.pdf >> >> Best regards, >> Augusto >> >>> On 31 May 2016, at 22:44, Tilman Hausherr <thaush...@t-online.de> wrote: >>> >>> Looks like a different problem. Assuming you're using the latest version, >>> you might want to try setting >>> >>> PDFTextStripper.setSpacingTolerance() >>> >>> the default is 0.5f >>> >>> So try some values slightly above or below, i.e. 0.4f, 0.6f, etc. >>> >>> another one is >>> >>> setAverageCharTolerance() >>> >>> the default is 0.3f. >>> >>> Tilman >>> >>> Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva: >>>> Hi, >>>> >>>> PDFDebugger shows the following. >>>> (The ) Tj >>>> 22.7679 0 Td >>>> (es t) Tj >>>> 12.2023 0 Td >>>> (ab lis) Tj >>>> 20.7981 0 Td >>>> (h m) Tj >>>> 14.0054 0 Td >>>> (ent ) Tj >>>> 19.1013 0 Td >>>> (of ) Tj >>>> 14.83369 0 Td >>>> (an ) Tj >>>> 16.0359 0 Td >>>> (in te gr) Tj >>>> 25.72701 0 Td >>>> (ate) Tj >>>> 12.80299 0 Td >>>> (d ) Tj >>>> >>>> I am not sure if it is the same problem. I will try to get permission to >>>> upload the document somewhere tomorrow. >>>> >>>> Best regards, >>>> Augusto >>>> >>>>> On 31 May 2016, at 18:23, Tilman Hausherr <thaush...@t-online.de> wrote: >>>>> >>>>> Please upload the file somewhere. If you've used PDFDebugger before, have >>>>> a look here: >>>>> https://issues.apache.org/jira/browse/PDFBOX-3248 >>>>> and then look at your content stream whether it is the same problem. >>>>> >>>>> Tilman >>>>> >>>>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva: >>>>>> Hi all, >>>>>> >>>>>> I am using PDFBox java library to read the content of some PDFs and it >>>>>> seems like it inserts some weird (hyphen-like) spacing. I get the same >>>>>> result using the PDFBox-App command line util. >>>>>> >>>>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age >>>>>> ment (PRM) sys tem can po ten tially ad dress sev eral as pets >>>>>> >>>>>> I tried to extract text from the same PDF using the pdftotext command >>>>>> line utility it extracts the text correctly: >>>>>> The establishment of an integrated Partner Relationship Management (PRM) >>>>>> system can potentially address several aspects >>>>>> >>>>>> Does somebody have any idea why PDFBox behaves in this way and any tips >>>>>> to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for >>>>>> PDF processing underneath. >>>>>> >>>>>> Best regards, >>>>>> Augusto >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org