Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

zdenko podobny Fri, 04 Jul 2014 11:47:25 -0700

First of all - the source code is documentation...

Next - it is just your expectation that it is wrong ;-)
Did manual changing of values bring any improvement to OCR? I would not be
surprised if that values are not use by current version of tesseract.



Zdenko


On Fri, Jul 4, 2014 at 7:02 PM, Albrecht Hilker <[email protected]>
wrote:

>
> > Can you please provide explanation why do you think that
> "unicharset_extractor.exe produces wrong and uncomplete files"?
>
> Because this is definitely wrong:
>
> 90
> NULL 0 NULL 0
> A 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # A [41 ]A
> B 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # B [42 ]A
> C 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # C [43 ]A
> D 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # D [44 ]A
> E 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # E [45 ]A
> F 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # F [46 ]A
> G 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # G [47 ]A
> H 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # H [48 ]A
> I 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # I [49 ]A
> J 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # J [4a ]A
> K 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # K [4b ]A
> L 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # L [4c ]A
> M 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # M [4d ]A
> N 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # N [4e ]A
> O 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # O [4f ]A
> P 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # P [50 ]A
> Q 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # Q [51 ]A
> R 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # R [52 ]A
> S 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # S [53 ]A
> T 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # T [54 ]A
> U 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # U [55 ]A
> V 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # V [56 ]A
> W 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # W [57 ]A
> X 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # X [58 ]A
> Y 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # Y [59 ]A
> Z 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0  # Z [5a ]A
> a 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0   # a [61 ]a
> b 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0      # b [62 ]a
> c 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0      # c [63 ]a
> d 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0      # d [64 ]a
>
>
>
> 1.)
> The column "other_case" should contain the ID of the other-case letter.
> For the lowercase letters they point correctly to the uppercase letters.
> But the uppercase letters they all have a value of -1 which is wrong.
> Here should be the corresponding ID of the lowercase letter.
>
> 2.)
> The script name is always NULL.
> It should be LATIN or COMMON
>
> 3.)
> All the min / max values are completely missing.
> They are 0, 255 or 32767.
> 10 missing columns!
>
> 4.)
> The last column "normed_form" is missing.
> With the '#' a comment is starting.
> But when reading this unicharset the '#' is misinterpreted as the
> "normed_form".
> Here should be mostly the same letter as in the first column.
>
>
>
> Here you see a unicharset extracted from a trainddata file with all
> columns filled correctly:
>
> A 5 52,68,216,255,100,216,0,17,98,231 Latin 2 0 15 A    # A [41 ]A
> B 5 62,68,216,255,91,227,0,27,106,227 Latin 23 0 102 B  # B [42 ]A
>
> etc..
>
> a 3 58,65,186,200,85,164,0,26,97,185 Latin 15 0 2 a     # a [61 ]a
> b 3 58,64,216,255,87,180,0,25,100,200 Latin 102 0 23 b  # b [62 ]a
>
>
> Result:
> The unicharset_extractor tool is very buggy.
> I have to edit all by hand.
>
>
> So my question remains:
>
> Were do I find a detailed documentation of the Unicharset file ???
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/40473df4-df33-45f1-a593-2348f15b6b0b%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/40473df4-df33-45f1-a593-2348f15b6b0b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xOV1-r5-4brRxHZUEzcgRRB2UWbjnC_VbwMwxfiUG7Rw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Reply via email to