hi,
$ tesseract 205c.tif 205c --psm 6 -c
tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&
as per my knowledge tessedit_char_whitelist works with tesseract 3 not with
tesseract 4.

On Mon, Dec 23, 2019 at 3:10 PM Nicholas Rees <[email protected]>
wrote:

> Below is a bug report that I'm considering making. However, I'm not
> entirely positive that its a bug and I'd like someone who knows more about
> this to check this and make sure that this is a bug so I'm not wasting
> anyone's time.
>
> The following is the bug report that I'll post if you guys think its right.
>
>
> ### Environment
>
> * **Tesseract Version**:
> tesseract 4.1.0
> leptonica-1.78.0
> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff
> 4.1.0 : zlib 1.2.11 : libwebp 1.0.3
> Found AVX2
> Found AVX
> Found SSE
> Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2
> libzstd/1.4.3
> * **Commit Number**:
> From pacman Arch repository (NOT THE AUR)
> * **Platform**: Linux NickArch 5.4.3-arch1-1 #1 SMP PREEMPT Fri, 13 Dec
> 2019 09:39:02 +0000 x86_64 GNU/Linux
>
> ### Current Behavior:
> Sample Image link: https://imgur.com/a/TNH3tOx
>
> Tesseract will interpret certain characters weirdly (i.e. F as the yen
> symbol, or E as sometimes '='). The following command correctly whitelists
> the characters that will appear on the pages, and almost completely
> eliminates that problem:
>
> $ tesseract 205c.tif 205c --psm 6 -c
> tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&
>
> However, since the images are formatted like a table, tesseract will not
> recognize the smaller spaces in the third column. To fix that issue, I can
> run the following command.
>
> $ tesseract 205c.tif 205c --psm 6 -c tosp_min_sane_kn_sp=0.0
>
> This command completely fixes the spacing problem. However, the previous
> command obviously does not whitelist the characters so there are many more
> errors. So I need to run the -c arguments together. I do this by using a
> config file:
>
> config_file:
> tosp_min_sane_kn_sp 0.0
> tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&
>
> Then I run
>
> $ tesseract 205c.tif 205c --psm 6 config_file
>
> Tesseract will always ignore one of these options no matter what I do.
> Maybe I'm doing it wrong, but I've followed what other config files have
> shown and other command line options. However, I've also tried running the
> command with more than one -c option. In both cases I cannot get both
> config variables to work together.
>
> ### Expected Behavior:
> $ Tesseract --help-extra
> "-c VAR=VALUE                         Set value for config variables.
>                                                   Multiple -c arguments
> are allowed."
> ### Suggested Fix:
> I'm not even sure if this is a bug, but it definitely seems like it to me.
> I don't think I have the expertise to look into why this isn't working.
> Maybe I'm wrong here.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a552cd6a-2c06-4d79-80ec-a973aaecf2fa%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a552cd6a-2c06-4d79-80ec-a973aaecf2fa%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Thanks & regards,
Ashwini

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHxwgLiFoyYzORLzwFf7x2P498JzymVSDLejw0%2BnMjgRj6qeNA%40mail.gmail.com.

Reply via email to