Hi All,
I'm using Tesseract 3.02.02 on a windows 7 computer, via gImageReader GUI 
front-end (so I don't have to go into the black stuff, ms-dos).
Works well, except... same problem as everyone else: character sequence fi 
and fl are replaced by unicode(?) characters 0xFB01 and 0xFB02, latin 
ligatures small fi and fl.

Solution in a few other threads is to put a blacklist in the config file, 
but I've tried and not succeeded. How do you actually do that in the 
windows operating system?

Firstly: There is no config file, as such. Tesseract is not "installed", 
but has its files copied across to the directory:
C:\Users\rob\AppData\Local\Tesseract-OCR

Deeper down there are 3 more directories:

1.    C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata
which has the files:
eng.traineddata
eng.cube.fold
eng.cube.lm_
eng.cube.word-freq
eng.cube.size
eng.cube.nn
eng.cube.params
eng.cube.bigrams
eng.cube.lm
eng.tesseract_cube.nn
osd.traineddata

plus 2 directories:

2.     C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\configs
which has the files:
ambigs.train
api_config
bigram
box.train
box.train.stderr
digits
hocr
inter
kannada
linebox
logfile
makebox
quiet
rebox
strokewidth
unlv

3.    C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\tessconfigs
which has the files:
batch
batch.nochop
matdemo
msdemo
nobatch
segdemo


Is one of these the "configuration" file I need to edit?

Note also, windows standard editor would be ms-notepad, you have option to 
save text as ANSI, UTF-8, Unicode or Unicode big-endian. Which is the 
correct one to use - ANSI is standard, but won't allow you to save the 
ligatures, so it must be one of the others. I've tried them all, editing 
existing files and adding new files. Always failed.


More info: I know nothing about programming, have no compiler on my 
computer. I downloaded working executables from sourceforge or github or 
googlecode or somewhere. Managed to get them going without too much fuss by 
following the instructions.
I never did any training of Tesseract - it came already trained, presumably.

But I can't find any simple configuration instructions to follow to get rid 
of the latin fi and fl ligatures by editing windows files. And I want to 
get rid of them - convert each to two standard english letters for saving 
the files as english text.

Any help appreciated,
Regards,
Rob



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eef3df68-25db-4a95-b0ef-9786edbbb99a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to