[tesseract-ocr] How to extract text for processing by tesseract v4?

Jason Tue, 07 May 2019 18:08:35 -0700

I have a problem with the current tesseract. I have documents that have 
sections of varying background and text colors. Ive read that tesseract v3 
was white/black invariant and it didn't matter if I had white text on red 
background. But now it matters. The problem is, other parts in the same 
image are black text on white background. Tesseract 4 fails to identify the 
white text on red background at all.


I have tried inverting the image colors so red (0xFF0000) becomes cyan 
(0x00FFFF) and the white text (0xFFFFFF) becomes black (0x000000). I then 
take the highest confidence text for the region. This improves some 
scenarios, but for the red/white scenario, does not work.

Questions:
1. How can I extract the text to be black and the background to be white, 
before using tesseract? 
2. Is there a way to configure tesseract to "just work"?

I've been trying to figure out how to do this for some time, and I haven't 
made any progress.



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0c9cb359-bde4-4c2e-9643-1a9c56b639dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] How to extract text for processing by tesseract v4?

Reply via email to