Re: Extract text and images into html version of pdf

Tilman Hausherr Fri, 16 May 2025 04:06:40 -0700

Hi,

PDFBox doesn't have much about this. Apache Tika (which uses PDFBox) hasbetter support re: tables.


Tilman

On 16.05.2025 12:47, Mathias Hultman wrote:

Hi!

I am trying to get pdfbox to convert a number of pdfs into a html-version. It 
should so far as it is possible look like the pdf, with the structure of 
images, tables, and text intact. But Im running into problems when trying to 
accomplish this, and I find that the documentation is sort of lacking.
Ive managed to extract all text from the pdf, and Ive managed to extract all 
the images extending PDFStreamEngine. Now I want to ‘merge’ these two into the 
same application, where consideration is taken to the placement of the pictures 
in regards to the text. Can anyone please help me out?

Regards,
Mathias Hultman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Extract text and images into html version of pdf

Reply via email to