Hi Alexander,

Tweaking Tesseract parameters won't help you at all. Preprocessing - yes,
you'd need to remove as much graphics as possible, leaving text only. Major
steps required for this:
1. Threshold image so that all shades of gray become black
2. Label connected components (CCs)
3. Erase CCs that are too big in either X or Y direction, or both (bigger
than an average character). This will leave only text
4. Crop regions containing dense text
5. Process these regions one by one with Tesseract to produce final results

This can be done e.g. by ImageMagick and shell/batch scripting. I can show
how if you're interested. For some clues on that see my post in this
thread:
https://groups.google.com/forum/#!msg/tesseract-ocr/STHaLGYsiCo/pCT2kxMgwI8J

Best regards,
Dmitri Silaev
www.CustomOCR.com





On Mon, Apr 27, 2015 at 9:34 PM, Alexander Pico <[email protected]>
wrote:

> I am trying to identify the molecules from pathway images. This should be
> relatively simple from clear, high-res images like the one attached, but my
> attempts with Tesseract so are are pretty dismal...
>
> It found 9 of 25 molecules. I even have the luxury of knowing in advance
> all the words I'd like extract and tried supplying these as eng.user-words,
> but there was no improvement.
>
> I suspect I need to find the magic combination of parameter settings or
> perhaps image pre-processing.  Any suggestions?
>
> Thanks!
>  - Alex
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFOPTFVWnZcBgjNUiG22bjh%2B_KS%2Bq9xCLf7U%2BzboSuWNWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to