Albrecht,

thanks for this perfect description: > PDF is just a "visually appealing" 
GUI.
We laughed heartily.

The spelling "Portable Data Format" is also incorrect in other respects: 
PDF is a "Portable *Pages* Format" because the page is the basis.
This is explained by the original purpose of PDFbeing a "data format" for 
prepress, and prepress is all about pages.
For our extensive "PDF to Solr" project we are now going a different way.
We prepare the PDFs of our "data suppliers" with a commercial, very good 
Windows program package in such a way that we receive a separate PDF file 
and a good text file from each page. "Good text file" means that the OCR 
only minimally checks the page formatting in the PDF file (paragraphs, 
boxes) and makes the text really usable with the help of dictionaries and 
perhaps some magic.

Best
Walter Claassen
[email protected]

PS Your first and last name sounds German. Mee too.




"Albretch Mueller" <[email protected]> schrieb am 18.07.2022 11:04:07:

> Von: "Albretch Mueller" <[email protected]>
> An: [email protected], [email protected]
> Datum: 18.07.2022 11:05
> Betreff: from pdf to some sort of XMLish ODT kind of file ...
> 
>  it is in its name: https://en.wikipedia.org/wiki/PDF
>  but, as a corpora researcher, I have always wondered what exactly are
> the "portable", "document" and "format" aspects of it.  PDF is just a
> "visually appealing" GUI.
> 
>  The processes of conversion of the different kinds of PDFs to text is
> not exactly straightforward, it is way too entropic (too much of the
> necessary "information" to do the conversion is lost). Some pdf files
> are image-based (no text at all), some are image-based, but include
> (some of) the text, some of the image-based pdf files also contain
> images, ...
> 
>  Do you know of any kind of prior art studying and/or explaining
> possible solutions to these kinds of pdf to xmlish text conversion
> problems? Any suggestion about how you would approach a solution to
> them?
> 
>  Thank you,
>  lbrtchx

Reply via email to