So before processing a document, we want to rejects ones which are CJK so
I've used Tesseract for this.. It does pretty good job but some times when
document quality is low then from "Table of Contents" page, most of the
dots are recognized as "CJK" characters. I am planning to create own
training data but wanted to get advice from experts.
*Config:*
- Tesseract 4.0
- instance.setLanguage("chi_simB+chi_traB+korB+jpnB+engB");
- instance.setOcrEngineMode(1);
Image is zoomed to 600% in Adobe PDF reader.
Please let me know.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/95138faa-307f-4417-b72c-648ab84993d9%40googlegroups.com.