Hi again,

This page
<https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch>
says that unicharset_extractor is buggy, so I wrote a Python program to
do it instead.  Does the attached file look right, and should it work
with Tesseract 4.0?

Thanks,
Adam




On 13/09/2019 18:39, Shree Devi Kumar wrote:
> Yes, I also noticed this problem recently.
> 
> My workaround is to create the unicharset from the training text/ground
> truth files rather than from box files.
> 
> Look at the help for unicharset_extractor 
> 
> On Fri, Sep 13, 2019, 22:08 J Adam Funk <a.f...@sheffield.ac.uk
> <mailto:a.f...@sheffield.ac.uk>> wrote:
> 
>     Hi,
> 
>     I'm using tesseract 4.0.0 (Ubuntu package version 4.0.0-2) and
>     trying to set up training data. I have a Python tool that puts
>     random words in an image (using PIL) and saves the resulting *.box
>     and *.tif files, using the line-of-text per line of box file format.
>     I'm now trying to work through the training process, and the
>     unicharset is treating the "Wordstr" at the beginning as the
>     string.  My box files look like this, which I think follows the
>     examples at
>     
> <https://github.com/tesseract-ocr/tesseract/issues/2357#issuecomment-477239316>:
> 
>     Wordstr 68 102 1326 1205 0 #COMPASSED PERUVIANS
>     68 102 1326 1205 0
>     Wordstr 68 662 1260 465 0 #BIMINI'S
>     68 662 1260 465 0
> 
>     and the resulting unicharset file is treating "Wordstr" as the text,
>     so I get this:
> 
>     9
>     NULL 0 Common 0
>     Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined# Joined [4a 6f
>     69 6e 65 64 ]a
>     |Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1# Broken
>     W 5 0,255,0,255,0,0,0,0,0,0 Latin 3 0 3 W# W [57 ]A
>     o 3 0,255,0,255,0,0,0,0,0,0 Latin 4 0 4 o# o [6f ]a
>     r 3 0,255,0,255,0,0,0,0,0,0 Latin 5 0 5 r# r [72 ]a
>     d 3 0,255,0,255,0,0,0,0,0,0 Latin 6 0 6 d# d [64 ]a
>     s 3 0,255,0,255,0,0,0,0,0,0 Latin 7 0 7 s# s [73 ]a
>     t 3 0,255,0,255,0,0,0,0,0,0 Latin 8 0 8 t# t [74 ]a
> 
>     What am I doing wrong?
> 
>     Thanks,
>     Adam
> 
>     -- 
>     You received this message because you are subscribed to the Google
>     Groups "tesseract-ocr" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to tesseract-ocr+unsubscr...@googlegroups.com
>     <mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
>     To view this discussion on the web visit
>     
> https://groups.google.com/d/msgid/tesseract-ocr/60ec5ff4-2125-4342-bb9e-feae4dfa91fc%40googlegroups.com
>     
> <https://groups.google.com/d/msgid/tesseract-ocr/60ec5ff4-2125-4342-bb9e-feae4dfa91fc%40googlegroups.com?utm_medium=email&utm_source=footer>.
> 
> -- 
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/xwITlwIq01k/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com
> <mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn93kAHwM2nkMAeitSodP__Gh_-MbYBTHA9090__oynw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn93kAHwM2nkMAeitSodP__Gh_-MbYBTHA9090__oynw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b150639-4ec7-5294-7633-161ce98a5703%40sheffield.ac.uk.
64
NULL 0 Common 0
' 10 Common 1
0 0 Common 2
1 0 Common 3
2 0 Common 4
3 0 Common 5
4 0 Common 6
5 0 Common 7
6 0 Common 8
7 0 Common 9
8 0 Common 10
9 0 Common 11
A 5 Latin 12
B 5 Latin 13
C 5 Latin 14
D 5 Latin 15
E 5 Latin 16
F 5 Latin 17
G 5 Latin 18
H 5 Latin 19
I 5 Latin 20
J 5 Latin 21
K 5 Latin 22
L 5 Latin 23
M 5 Latin 24
N 5 Latin 25
O 5 Latin 26
P 5 Latin 27
Q 5 Latin 28
R 5 Latin 29
S 5 Latin 30
T 5 Latin 31
U 5 Latin 32
V 5 Latin 33
W 5 Latin 34
X 5 Latin 35
Y 5 Latin 36
Z 5 Latin 37
a 3 Latin 38
b 3 Latin 39
c 3 Latin 40
d 3 Latin 41
e 3 Latin 42
f 3 Latin 43
g 3 Latin 44
h 3 Latin 45
i 3 Latin 46
j 3 Latin 47
k 3 Latin 48
l 3 Latin 49
m 3 Latin 50
n 3 Latin 51
o 3 Latin 52
p 3 Latin 53
q 3 Latin 54
r 3 Latin 55
s 3 Latin 56
t 3 Latin 57
u 3 Latin 58
v 3 Latin 59
w 3 Latin 60
x 3 Latin 61
y 3 Latin 62
z 3 Latin 63

Reply via email to