I'm writing a program to convert tiff images of books to ePubs.  I have a 
bunch (4000) book images which I converted in the 2000's with FineReader.  
I want to improve the results and am too cheap to buy an updated program.  
Plus, it's fun.

Tesseract looks like it gives equal or better results than my original 
system, however, the current incarnation does not support bold or italic, 
which is important, though arguably not essential.

The last I could find on this was from 2022 
<https://github.com/sirfz/tesserocr/issues/292>.  A bit more informative is 
this <https://github.com/tesseract-ocr/tesseract/issues/1074>.

Basically, the latter says that the information for bold and italic (at 
least) is available at some level in the code hierarchy, but would need 
some work to expose (from theraysmith) - or at least this is how I 
interpreted it.  There was some indication that this would be desirable, 
but I'm not sure it's on your roadmap.

If it is, do you know when?  If not, could it be added?  If no to that, is 
it possible to run both Version 3 and Version 5 recognition?

My concern with the latter is that it appears that version 3 paths are 
explicitly commented out in V5 though a #define.  This #define seems to be 
generated early in the compilation process by some Linuxy tools that are 
well beyond my (limited) Linux experience.  How could I generate a library 
/ set of dlls which would allow me to run both recognisers (one after the 
other probably and then pick the 'best' result)?

Hope this makes sense, and thanks in advance

Iain

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/990566fa-e0f9-4e9c-9244-9f06c70e53afn%40googlegroups.com.

Reply via email to