[tesseract-ocr] OCR on base64 text

Theo Cushion Thu, 23 May 2019 00:45:21 -0700

Hi,

I'm attempting to train tesseract to recognise base64 text in a certain 
font. I'm generating a page of data and the training data like so:


```
# Generate a page of sample text
$ dd if=/dev/urandom count=880 bs=1 | base64 --wrap=42 > base64_sample.txt 

# Generate the tif file in format the same as what I am attempting to decode
$ text2image --fonts_dir=. --font=Consolas --text=base64_sample.txt 
--outputbase=eng.consolas.exp0 --degrade_image=false --margin=10 
--ptsize=38 --resolution=70 --xsize=860 --ysize=1200 --leading=4 
# -> generates eng.consolas.exp0.box eng.consolas.exp0.tif 

$ tesseract eng.consolas.exp0.tif eng.consolas.exp0 box.train 
# -> generates eng.consolas.exp0.tr 

$ unicharset_extractor --output_unicharset eng.unicharset 
eng.consolas.exp0.box 
# -> generates eng.unicharset 

$ echo ‘consolas 0 0 0 1 0 0’ > eng.font_properties 

$ shapeclustering -F eng.font_properties -U eng.unicharset 
eng.consolas.exp0.tr 
# -> generates shapetable 
$ mv shapetable eng.shapetable 

$ mftraining -F eng.font_properties -U eng.unicharset -O eng.unicharset 
eng.consolas.exp0.tr 
# -> generates shapetable 
# -> generates pffmtable 
# -> generates inttemp 
$ mv shapetable eng.shapetable 
$ mv pffmtable eng.pffmtable 
$ mv inttemp eng.inttemp 

$ cntraining eng.consolas.exp0.tr 
# -> generates normproto 
$ mv normproto eng.normproto 

$ combine_tessdata eng. 
Combining tessdata files 
Output en.traineddata created successfully. 
Version string:4.0.0-beta.1 
1:unicharset:size=3693, offset=192 
3:inttemp:size=345347, offset=3885 
4:pffmtable:size=464, offset=349232 
5:normproto:size=7862, offset=349696 
13:shapetable:size=1084, offset=357558 
23:version:size=12, offset=358642 
-> generates eng.traineddata 
```

I then decode real data using:

```
tesseract base64.png base64  --tessdata-dir ./ -l eng 
```

However, when I go to decode a document that is very similar in dimensions 
and using the same font I still get errors (e.g. incorrect capitalisation 
or wrong letters entirely). How can I improve the accuracy?

Thanks,

Theo

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f75aa6ae-7be6-44ba-9223-d68a7c86f8ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] OCR on base64 text

Reply via email to