I have a problem with the current tesseract. I have documents that have sections of varying background and text colors. Ive read that tesseract v3 was white/black invariant and it didn't matter if I had white text on red background. But now it matters. The problem is, other parts in the same image are black text on white background. Tesseract 4 fails to identify the white text on red background at all.
I have tried inverting the image colors so red (0xFF0000) becomes cyan (0x00FFFF) and the white text (0xFFFFFF) becomes black (0x000000). I then take the highest confidence text for the region. This improves some scenarios, but for the red/white scenario, does not work. Questions: 1. How can I extract the text to be black and the background to be white, before using tesseract? 2. Is there a way to configure tesseract to "just work"? I've been trying to figure out how to do this for some time, and I haven't made any progress. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c9cb359-bde4-4c2e-9643-1a9c56b639dc%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

