Below is a bug report that I'm considering making. However, I'm not 
entirely positive that its a bug and I'd like someone who knows more about 
this to check this and make sure that this is a bug so I'm not wasting 
anyone's time.

The following is the bug report that I'll post if you guys think its right.


### Environment

* **Tesseract Version**:
tesseract 4.1.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 
4.1.0 : zlib 1.2.11 : libwebp 1.0.3
Found AVX2
Found AVX
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 
libzstd/1.4.3
* **Commit Number**: 
>From pacman Arch repository (NOT THE AUR)
* **Platform**: Linux NickArch 5.4.3-arch1-1 #1 SMP PREEMPT Fri, 13 Dec 
2019 09:39:02 +0000 x86_64 GNU/Linux

### Current Behavior: 
Sample Image link: https://imgur.com/a/TNH3tOx

Tesseract will interpret certain characters weirdly (i.e. F as the yen 
symbol, or E as sometimes '='). The following command correctly whitelists 
the characters that will appear on the pages, and almost completely 
eliminates that problem:

$ tesseract 205c.tif 205c --psm 6 -c 
tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&

However, since the images are formatted like a table, tesseract will not 
recognize the smaller spaces in the third column. To fix that issue, I can 
run the following command.

$ tesseract 205c.tif 205c --psm 6 -c tosp_min_sane_kn_sp=0.0

This command completely fixes the spacing problem. However, the previous 
command obviously does not whitelist the characters so there are many more 
errors. So I need to run the -c arguments together. I do this by using a 
config file:

config_file:
tosp_min_sane_kn_sp 0.0
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&

Then I run

$ tesseract 205c.tif 205c --psm 6 config_file

Tesseract will always ignore one of these options no matter what I do. 
Maybe I'm doing it wrong, but I've followed what other config files have 
shown and other command line options. However, I've also tried running the 
command with more than one -c option. In both cases I cannot get both 
config variables to work together.

### Expected Behavior:
$ Tesseract --help-extra
"-c VAR=VALUE                         Set value for config variables.
                                                  Multiple -c arguments are 
allowed."
### Suggested Fix:
I'm not even sure if this is a bug, but it definitely seems like it to me. 
I don't think I have the expertise to look into why this isn't working. 
Maybe I'm wrong here.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a552cd6a-2c06-4d79-80ec-a973aaecf2fa%40googlegroups.com.

Reply via email to