Hi Manuel, I'm sorry for my mistake and many thanks for your help and attention.
The best tool that I know to extract text from a PDF ( I didn't test Monarch), maintaining the correct layout, is inside a CAAT software: Caseware IDEA. However this software is very expensive and does a lot of other things. All the others tools that I tested (and I tested several) do wrong positioning analysis. It will be good to develop a tool to produce similar results obtained with IDEA. The work that you developed can help others to achieve that result. Paulo -----Mensagem original----- De: Manuel Aristarán [mailto:[email protected]] Em nome de Manuel Aristarán Enviada: quarta-feira, 28 de dezembro de 2016 20:37 Para: [email protected] Assunto: Re: Identify not visible characters - Overlapped characters Hi Paulo, > On Dec 28, 2016, at 9:52 AM, [email protected] wrote: > > Unfortunately, Tabula uses a totally different approach (image > analysis) [...] Sorry for going (sort of) off-topic, but that's not correct. In fact, Tabula does not support images. Thanks to PDFBox, it "mines" text and graphical elements, and uses a set of heuristics that attempt reconstruct a tabular structure. > Tabula also do incoherent analysis when a table is larger than one > page, for that reason Tabula is far from being a good tool for text > extraction with correct positioning. We always welcome bug reports (and patches!) :) [1] Thanks! [1] https://github.com/tabulapdf/tabula-java/issues — Manuel Aristarán <[email protected]> http://jazzido.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

