Hi. Ok. I understand. Nevermind :) Thanks. El lun., 12 de nov. de 2018 11:16 p. m., Tilman Hausherr < [email protected]> escribió:
> Am 12.11.2018 um 19:56 schrieb jorgeeflorez: > > Hi all, > > > > first, I want to thank Tilman for his effort getting the text from a page > > regardless its rotation. > > (https://issues.apache.org/jira/browse/PDFBOX-4371). > > > > second, I want to share with you a small application I created using C#. > It > > uses ITextSharp library and a custom text extraction strategy to get the > > text. > > > > Application: here > > < > https://drive.google.com/file/d/1CmKvkib_ONTytwaoIrrmMdVyICXO1IPd/view?usp=sharing > > > > Class that process text: here > > < > https://drive.google.com/file/d/1u3VykdQR8Eh9ooRiqxc4q2_20w3lw8gw/view?usp=sharing > > > > Sample PDF files: here > > < > https://drive.google.com/file/d/1KdpQEIEbIl5ZETq33C2X8JVM5qfMXlDg/view?usp=sharing > > > > > > I was trying to port the code to Java and make it work using PDFBox > > objects, but so far, it has been not possible to me. > > > > Basically, the magic occurs in method RenderText (Based on other code I > > found in a web page I don't remember :( ). It uses vectors (origin is > lower > > left corner of the page) to determine stuff like if there is a line > break, > > or if a whitespace must be put between glyphs. > > > > I just hope this code gives you some light to adjust or improve (if you > > consider it necessary) text extraction. > > > Hi, thanks but sorry, but there are several reasons that I can't use it: > 1) I don't know itext, 2) I can't use code "found in a web page I don't > remember" (license!), 3) I don't run exe files. > > I think our TextStripper code is similar that it uses some algorithms to > decide where to insert blanks, and whether glyphs are on a line or not. > > Tilman > > > > > > That's it. > > > > Thank you. > > Best Regards. > > > > Jorge Eduardo Flórez > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

