Hi Alexander, Tweaking Tesseract parameters won't help you at all. Preprocessing - yes, you'd need to remove as much graphics as possible, leaving text only. Major steps required for this: 1. Threshold image so that all shades of gray become black 2. Label connected components (CCs) 3. Erase CCs that are too big in either X or Y direction, or both (bigger than an average character). This will leave only text 4. Crop regions containing dense text 5. Process these regions one by one with Tesseract to produce final results
This can be done e.g. by ImageMagick and shell/batch scripting. I can show how if you're interested. For some clues on that see my post in this thread: https://groups.google.com/forum/#!msg/tesseract-ocr/STHaLGYsiCo/pCT2kxMgwI8J Best regards, Dmitri Silaev www.CustomOCR.com On Mon, Apr 27, 2015 at 9:34 PM, Alexander Pico <[email protected]> wrote: > I am trying to identify the molecules from pathway images. This should be > relatively simple from clear, high-res images like the one attached, but my > attempts with Tesseract so are are pretty dismal... > > It found 9 of 25 molecules. I even have the luxury of knowing in advance > all the words I'd like extract and tried supplying these as eng.user-words, > but there was no improvement. > > I suspect I need to find the magic combination of parameter settings or > perhaps image pre-processing. Any suggestions? > > Thanks! > - Alex > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFOPTFVWnZcBgjNUiG22bjh%2B_KS%2Bq9xCLf7U%2BzboSuWNWQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

