Hi All, I'm using Tesseract 3.02.02 on a windows 7 computer, via gImageReader GUI front-end (so I don't have to go into the black stuff, ms-dos). Works well, except... same problem as everyone else: character sequence fi and fl are replaced by unicode(?) characters 0xFB01 and 0xFB02, latin ligatures small fi and fl.
Solution in a few other threads is to put a blacklist in the config file, but I've tried and not succeeded. How do you actually do that in the windows operating system? Firstly: There is no config file, as such. Tesseract is not "installed", but has its files copied across to the directory: C:\Users\rob\AppData\Local\Tesseract-OCR Deeper down there are 3 more directories: 1. C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata which has the files: eng.traineddata eng.cube.fold eng.cube.lm_ eng.cube.word-freq eng.cube.size eng.cube.nn eng.cube.params eng.cube.bigrams eng.cube.lm eng.tesseract_cube.nn osd.traineddata plus 2 directories: 2. C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\configs which has the files: ambigs.train api_config bigram box.train box.train.stderr digits hocr inter kannada linebox logfile makebox quiet rebox strokewidth unlv 3. C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\tessconfigs which has the files: batch batch.nochop matdemo msdemo nobatch segdemo Is one of these the "configuration" file I need to edit? Note also, windows standard editor would be ms-notepad, you have option to save text as ANSI, UTF-8, Unicode or Unicode big-endian. Which is the correct one to use - ANSI is standard, but won't allow you to save the ligatures, so it must be one of the others. I've tried them all, editing existing files and adding new files. Always failed. More info: I know nothing about programming, have no compiler on my computer. I downloaded working executables from sourceforge or github or googlecode or somewhere. Managed to get them going without too much fuss by following the instructions. I never did any training of Tesseract - it came already trained, presumably. But I can't find any simple configuration instructions to follow to get rid of the latin fi and fl ligatures by editing windows files. And I want to get rid of them - convert each to two standard english letters for saving the files as english text. Any help appreciated, Regards, Rob -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eef3df68-25db-4a95-b0ef-9786edbbb99a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

