Hi again, This page <https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch> says that unicharset_extractor is buggy, so I wrote a Python program to do it instead. Does the attached file look right, and should it work with Tesseract 4.0?
Thanks, Adam On 13/09/2019 18:39, Shree Devi Kumar wrote: > Yes, I also noticed this problem recently. > > My workaround is to create the unicharset from the training text/ground > truth files rather than from box files. > > Look at the help for unicharset_extractor > > On Fri, Sep 13, 2019, 22:08 J Adam Funk <a.f...@sheffield.ac.uk > <mailto:a.f...@sheffield.ac.uk>> wrote: > > Hi, > > I'm using tesseract 4.0.0 (Ubuntu package version 4.0.0-2) and > trying to set up training data. I have a Python tool that puts > random words in an image (using PIL) and saves the resulting *.box > and *.tif files, using the line-of-text per line of box file format. > I'm now trying to work through the training process, and the > unicharset is treating the "Wordstr" at the beginning as the > string. My box files look like this, which I think follows the > examples at > > <https://github.com/tesseract-ocr/tesseract/issues/2357#issuecomment-477239316>: > > Wordstr 68 102 1326 1205 0 #COMPASSED PERUVIANS > 68 102 1326 1205 0 > Wordstr 68 662 1260 465 0 #BIMINI'S > 68 662 1260 465 0 > > and the resulting unicharset file is treating "Wordstr" as the text, > so I get this: > > 9 > NULL 0 Common 0 > Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined# Joined [4a 6f > 69 6e 65 64 ]a > |Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1# Broken > W 5 0,255,0,255,0,0,0,0,0,0 Latin 3 0 3 W# W [57 ]A > o 3 0,255,0,255,0,0,0,0,0,0 Latin 4 0 4 o# o [6f ]a > r 3 0,255,0,255,0,0,0,0,0,0 Latin 5 0 5 r# r [72 ]a > d 3 0,255,0,255,0,0,0,0,0,0 Latin 6 0 6 d# d [64 ]a > s 3 0,255,0,255,0,0,0,0,0,0 Latin 7 0 7 s# s [73 ]a > t 3 0,255,0,255,0,0,0,0,0,0 Latin 8 0 8 t# t [74 ]a > > What am I doing wrong? > > Thanks, > Adam > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, > send an email to tesseract-ocr+unsubscr...@googlegroups.com > <mailto:tesseract-ocr+unsubscr...@googlegroups.com>. > To view this discussion on the web visit > > https://groups.google.com/d/msgid/tesseract-ocr/60ec5ff4-2125-4342-bb9e-feae4dfa91fc%40googlegroups.com > > <https://groups.google.com/d/msgid/tesseract-ocr/60ec5ff4-2125-4342-bb9e-feae4dfa91fc%40googlegroups.com?utm_medium=email&utm_source=footer>. > > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/xwITlwIq01k/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com > <mailto:tesseract-ocr+unsubscr...@googlegroups.com>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn93kAHwM2nkMAeitSodP__Gh_-MbYBTHA9090__oynw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn93kAHwM2nkMAeitSodP__Gh_-MbYBTHA9090__oynw%40mail.gmail.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1b150639-4ec7-5294-7633-161ce98a5703%40sheffield.ac.uk.
64 NULL 0 Common 0 ' 10 Common 1 0 0 Common 2 1 0 Common 3 2 0 Common 4 3 0 Common 5 4 0 Common 6 5 0 Common 7 6 0 Common 8 7 0 Common 9 8 0 Common 10 9 0 Common 11 A 5 Latin 12 B 5 Latin 13 C 5 Latin 14 D 5 Latin 15 E 5 Latin 16 F 5 Latin 17 G 5 Latin 18 H 5 Latin 19 I 5 Latin 20 J 5 Latin 21 K 5 Latin 22 L 5 Latin 23 M 5 Latin 24 N 5 Latin 25 O 5 Latin 26 P 5 Latin 27 Q 5 Latin 28 R 5 Latin 29 S 5 Latin 30 T 5 Latin 31 U 5 Latin 32 V 5 Latin 33 W 5 Latin 34 X 5 Latin 35 Y 5 Latin 36 Z 5 Latin 37 a 3 Latin 38 b 3 Latin 39 c 3 Latin 40 d 3 Latin 41 e 3 Latin 42 f 3 Latin 43 g 3 Latin 44 h 3 Latin 45 i 3 Latin 46 j 3 Latin 47 k 3 Latin 48 l 3 Latin 49 m 3 Latin 50 n 3 Latin 51 o 3 Latin 52 p 3 Latin 53 q 3 Latin 54 r 3 Latin 55 s 3 Latin 56 t 3 Latin 57 u 3 Latin 58 v 3 Latin 59 w 3 Latin 60 x 3 Latin 61 y 3 Latin 62 z 3 Latin 63