I guessed it was something like that... Do you think it's because it was generated with iText?
On Mon, Aug 10, 2015 at 6:35 PM, Andreas Lehmkuehler <[email protected]> wrote: > Hi, > > Am 10.08.2015 um 13:22 schrieb Gilad Denneboom: > >> Hi Andreas, >> >> Of course the output itself is different, but I would expect that the >> underlying text each tool processes would be the same, and it's not. Have >> a >> look at the first line in the PrintTextLocations output file: >> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5 >> width=2.7799988]: >> It is repeated, with exactly the same information, 12 times throughout the >> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991. >> >> Why would the same information be processed 12 times in a single run? >> > The pdf contains a lot of redundant information, e.g. the header is > repeated several times (I didn't count them but I guess it's 12 times). > PDFTextStripper eliminates overlapping text/characters and > PrintTextLocations doesn't. > > BR > Andreas > > > Gilad >> >> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <[email protected]> >> wrote: >> >> Hi Gilad, >>> >>> sorry for the late answer .... >>> >>> I'm not sure what you're expecting. You are using 2 totally different >>> approaches >>> to process a pdf. PrintTextLocations provides a lot of additional >>> information >>> for every piece of text, which may vary from one character up to whole >>> words or >>> lines of text. Consequently the output has to be totally different and of >>> course >>> much bigger than the output of a simple text extraction. >>> >>> BR >>> Andreas >>> >>> Gilad Denneboom <[email protected]> hat am 10. August 2015 um >>>> >>> 10:05 >>> >>>> geschrieben: >>>> >>>> >>>> No one has any ideas? >>>> >>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom < >>>> >>> [email protected]> >>> >>>> wrote: >>>> >>>> Hi everyone, >>>>> >>>>> I'm looking for advice on a problem I'm encountering where the output >>>>> >>>> of >>> >>>> PDFTextStripper and PrintTextLocations is dramatically different when >>>>> processing the same file. >>>>> For some reason, the output of PrintTextLocations is 12 times longer >>>>> >>>> than >>> >>>> that of PDFTextStripper, ie the entire text is printed out 12 times, >>>>> instead of just once. >>>>> >>>>> I'm attaching the file in question, as well as the output produced >>>>> >>>> using >>> >>>> both methods via Google Drive... Hopefully it will come through. >>>>> >>>>> I'd appreciate any ideas as to what might be causing this issue (I'm >>>>> guessing there's something wrong with the structure of the file), and >>>>> >>>> of >>> >>>> course any possible solutions. >>>>> >>>>> Thanks in advance, Gilad. >>>>> >>>>> PS. I'm using 1.8.10. >>>>> >>>>> output problem.zip >>>>> < >>>>> >>>> >>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web >>> >>>> >>>> >>>>> >>>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

