[tesseract-ocr] Bold, Italic and Tesseract 3

Iain Downs Sat, 16 Nov 2024 03:57:50 -0800

I'm writing a program to convert tiff images of books to ePubs.  I have a 
bunch (4000) book images which I converted in the 2000's with FineReader.  
I want to improve the results and am too cheap to buy an updated program.  
Plus, it's fun.

Tesseract looks like it gives equal or better results than my original
system, however, the current incarnation does not support bold or italic,
which is important, though arguably not essential.

The last I could find on this was from 2022
<https://github.com/sirfz/tesserocr/issues/292>. A bit more informative is
this <https://github.com/tesseract-ocr/tesseract/issues/1074>.

Basically, the latter says that the information for bold and italic (at
least) is available at some level in the code hierarchy, but would need
some work to expose (from theraysmith) - or at least this is how I
interpreted it. There was some indication that this would be desirable,
but I'm not sure it's on your roadmap.

If it is, do you know when? If not, could it be added? If no to that, is
it possible to run both Version 3 and Version 5 recognition?

My concern with the latter is that it appears that version 3 paths are
explicitly commented out in V5 though a #define. This #define seems to be
generated early in the compilation process by some Linuxy tools that are
well beyond my (limited) Linux experience. How could I generate a library
/ set of dlls which would allow me to run both recognisers (one after the
other probably and then pick the 'best' result)?

Hope this makes sense, and thanks in advance

Iain

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/990566fa-e0f9-4e9c-9244-9f06c70e53afn%40googlegroups.com.

[tesseract-ocr] Bold, Italic and Tesseract 3

Reply via email to