[tesseract-ocr] unicharset_extractor issue

Stevan Cakic Sun, 25 Aug 2019 11:02:02 -0700

Hi,

I'm trying to create .traindata for numbers (.tif file example):


[image: test.png]
First, I run this command: *tesseract eng.strangelabelmachinefont.exp0.tif 
.strangelabelmachinefont.exp0 batch.nochop makebox*
After that, I run this: *tesseract eng.strangelabelmachinefont.exp0.tif 
eng.strangelabelmachinefont.exp0 box.train*

For now, everything is ok. I can see *eng.strangelabelmachinefont.exp0*.*box 
*file created with this content:
1 13 17 34 61 0
8 51 15 81 61 0
3 97 14 125 59 0
5 141 13 170 58 0
0 184 13 216 58 0
3 231 13 261 58 0

I have a problem when calling this command: *unicharset_extractor *
*eng.strangelabelmachinefont.exp0**.box*
When I call above command file unicharset is created with this content:
8
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 
65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
1 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 1 # 1 [31 ]0
8 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 8 # 8 [38 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 3 # 3 [33 ]0
5 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 5 # 5 [35 ]0
0 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 0 # 0 [30 ]0

Problem is when I run next command: *shapeclustering -F font_properties 
unicharset file_name.tr*

I get tons of errors, mostly with bad format in tr file

Reading unicharset ...
Bad format in tr file, reading fontname, unichar
Bad box coordinates in boxfile string! 0 Common 0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 
1 Joined     # Joined [4a 6f 69 6e 65 64 ]a

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 15 0,255,0,255,0,0,0,0,0,0 Common 2 
10 2 |Broken|0|1       # Broken

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 
3 1 # 1 [31 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 
4 8 # 8 [38 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 
5 3 # 3 [33 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 
6 5 # 5 [35 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 
7 0 # 0 [30 ]0

Bad format in tr file, reading box coords
Reading eng.strangelabelmachinefont.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.263473
Master shape_table:Number of shapes = 5 max unichars = 1 number with 
multiple unichars = 0

I read this article: 
https://www.systutorials.com/docs/linux/man/5-unicharset/ but didn't help 
me.
My configuration: Windows 10 x64, using 
tesseract-ocr-w64-v5.0.0-alpha.20190708
This number is well recognized with pytesseract 

pytesseract.image_to_string(Image.open(image_path), lang="eng", 
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

But my goal is to create dataset to recognize digits in this situation for 
example:

[image: 369490.png]
I also try with some algorithms to remove these horizontal lines but 
results are not better, so it's better than to create custom .dataset
Does anyone have any suggestion, is this problem with my version on 
tesseract, or I have to  something manually with unicharset file?
Thanks.

Best Regards,
Stevan

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f2f88cfd-6503-49d1-93cd-e5871a75f321%40googlegroups.com.

[tesseract-ocr] unicharset_extractor issue

Reply via email to