Hi, I'm trying to create .traindata for numbers (.tif file example):
[image: test.png] First, I run this command: *tesseract eng.strangelabelmachinefont.exp0.tif .strangelabelmachinefont.exp0 batch.nochop makebox* After that, I run this: *tesseract eng.strangelabelmachinefont.exp0.tif eng.strangelabelmachinefont.exp0 box.train* For now, everything is ok. I can see *eng.strangelabelmachinefont.exp0*.*box *file created with this content: 1 13 17 34 61 0 8 51 15 81 61 0 3 97 14 125 59 0 5 141 13 170 58 0 0 184 13 216 58 0 3 231 13 261 58 0 I have a problem when calling this command: *unicharset_extractor * *eng.strangelabelmachinefont.exp0**.box* When I call above command file unicharset is created with this content: 8 NULL 0 Common 0 Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a |Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken 1 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 1 # 1 [31 ]0 8 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 8 # 8 [38 ]0 3 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 3 # 3 [33 ]0 5 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 5 # 5 [35 ]0 0 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 0 # 0 [30 ]0 Problem is when I run next command: *shapeclustering -F font_properties unicharset file_name.tr* I get tons of errors, mostly with bad format in tr file Reading unicharset ... Bad format in tr file, reading fontname, unichar Bad box coordinates in boxfile string! 0 Common 0 Bad format in tr file, reading box coords Bad box coordinates in boxfile string! 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a Bad format in tr file, reading box coords Bad box coordinates in boxfile string! 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken Bad format in tr file, reading box coords Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 1 # 1 [31 ]0 Bad format in tr file, reading box coords Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 8 # 8 [38 ]0 Bad format in tr file, reading box coords Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 3 # 3 [33 ]0 Bad format in tr file, reading box coords Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 5 # 5 [35 ]0 Bad format in tr file, reading box coords Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 0 # 0 [30 ]0 Bad format in tr file, reading box coords Reading eng.strangelabelmachinefont.exp0.tr ... Building master shape table Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 1 2 3 4 Stopped with 0 merged, min dist 0.263473 Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple unichars = 0 I read this article: https://www.systutorials.com/docs/linux/man/5-unicharset/ but didn't help me. My configuration: Windows 10 x64, using tesseract-ocr-w64-v5.0.0-alpha.20190708 This number is well recognized with pytesseract pytesseract.image_to_string(Image.open(image_path), lang="eng", config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789') But my goal is to create dataset to recognize digits in this situation for example: [image: 369490.png] I also try with some algorithms to remove these horizontal lines but results are not better, so it's better than to create custom .dataset Does anyone have any suggestion, is this problem with my version on tesseract, or I have to something manually with unicharset file? Thanks. Best Regards, Stevan -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f2f88cfd-6503-49d1-93cd-e5871a75f321%40googlegroups.com.

