Hi, Tweaking the parameters didn’t help. Here is a part of the pdf in question - https://dl.dropboxusercontent.com/u/2456015/problem.pdf
Best regards, Augusto > On 31 May 2016, at 22:44, Tilman Hausherr <thaush...@t-online.de> wrote: > > Looks like a different problem. Assuming you're using the latest version, you > might want to try setting > > PDFTextStripper.setSpacingTolerance() > > the default is 0.5f > > So try some values slightly above or below, i.e. 0.4f, 0.6f, etc. > > another one is > > setAverageCharTolerance() > > the default is 0.3f. > > Tilman > > Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva: >> Hi, >> >> PDFDebugger shows the following. >> (The ) Tj >> 22.7679 0 Td >> (es t) Tj >> 12.2023 0 Td >> (ab lis) Tj >> 20.7981 0 Td >> (h m) Tj >> 14.0054 0 Td >> (ent ) Tj >> 19.1013 0 Td >> (of ) Tj >> 14.83369 0 Td >> (an ) Tj >> 16.0359 0 Td >> (in te gr) Tj >> 25.72701 0 Td >> (ate) Tj >> 12.80299 0 Td >> (d ) Tj >> >> I am not sure if it is the same problem. I will try to get permission to >> upload the document somewhere tomorrow. >> >> Best regards, >> Augusto >> >>> On 31 May 2016, at 18:23, Tilman Hausherr <thaush...@t-online.de> wrote: >>> >>> Please upload the file somewhere. If you've used PDFDebugger before, have a >>> look here: >>> https://issues.apache.org/jira/browse/PDFBOX-3248 >>> and then look at your content stream whether it is the same problem. >>> >>> Tilman >>> >>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva: >>>> Hi all, >>>> >>>> I am using PDFBox java library to read the content of some PDFs and it >>>> seems like it inserts some weird (hyphen-like) spacing. I get the same >>>> result using the PDFBox-App command line util. >>>> >>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age >>>> ment (PRM) sys tem can po ten tially ad dress sev eral as pets >>>> >>>> I tried to extract text from the same PDF using the pdftotext command line >>>> utility it extracts the text correctly: >>>> The establishment of an integrated Partner Relationship Management (PRM) >>>> system can potentially address several aspects >>>> >>>> Does somebody have any idea why PDFBox behaves in this way and any tips to >>>> fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF >>>> processing underneath. >>>> >>>> Best regards, >>>> Augusto >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org