Hi Manuel,

I'm sorry for my mistake and many thanks for your help and attention.

The best tool that I know to extract text from a PDF ( I didn't test Monarch), 
maintaining the correct layout, is inside a CAAT software: Caseware IDEA. 
However this software is very expensive and does a lot of other things.

All the others tools that I tested (and I tested several) do wrong positioning 
analysis.

It will be good to develop a tool to produce similar results obtained with IDEA.

The work that you developed can help others to achieve that result.

Paulo
-----Mensagem original-----
De: Manuel Aristarán [mailto:[email protected]] Em nome de Manuel Aristarán
Enviada: quarta-feira, 28 de dezembro de 2016 20:37
Para: [email protected]
Assunto: Re: Identify not visible characters - Overlapped characters

Hi Paulo,

> On Dec 28, 2016, at 9:52 AM, [email protected] wrote:
> 
> Unfortunately, Tabula uses a totally different approach (image 
> analysis) [...]

Sorry for going (sort of) off-topic, but that's not correct. In fact, Tabula 
does not support images. Thanks to PDFBox, it "mines" text and graphical 
elements, and uses a set of heuristics that attempt reconstruct a tabular 
structure.

> Tabula also do incoherent analysis when a table is larger than one 
> page, for that reason Tabula is far from being a good tool for text 
> extraction with correct positioning.

We always welcome bug reports (and patches!) :) [1]

Thanks!

[1] https://github.com/tabulapdf/tabula-java/issues


—
Manuel Aristarán <[email protected]>
http://jazzido.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to