Hi, I'm attempting to train tesseract to recognise base64 text in a certain font. I'm generating a page of data and the training data like so:
``` # Generate a page of sample text $ dd if=/dev/urandom count=880 bs=1 | base64 --wrap=42 > base64_sample.txt # Generate the tif file in format the same as what I am attempting to decode $ text2image --fonts_dir=. --font=Consolas --text=base64_sample.txt --outputbase=eng.consolas.exp0 --degrade_image=false --margin=10 --ptsize=38 --resolution=70 --xsize=860 --ysize=1200 --leading=4 # -> generates eng.consolas.exp0.box eng.consolas.exp0.tif $ tesseract eng.consolas.exp0.tif eng.consolas.exp0 box.train # -> generates eng.consolas.exp0.tr $ unicharset_extractor --output_unicharset eng.unicharset eng.consolas.exp0.box # -> generates eng.unicharset $ echo ‘consolas 0 0 0 1 0 0’ > eng.font_properties $ shapeclustering -F eng.font_properties -U eng.unicharset eng.consolas.exp0.tr # -> generates shapetable $ mv shapetable eng.shapetable $ mftraining -F eng.font_properties -U eng.unicharset -O eng.unicharset eng.consolas.exp0.tr # -> generates shapetable # -> generates pffmtable # -> generates inttemp $ mv shapetable eng.shapetable $ mv pffmtable eng.pffmtable $ mv inttemp eng.inttemp $ cntraining eng.consolas.exp0.tr # -> generates normproto $ mv normproto eng.normproto $ combine_tessdata eng. Combining tessdata files Output en.traineddata created successfully. Version string:4.0.0-beta.1 1:unicharset:size=3693, offset=192 3:inttemp:size=345347, offset=3885 4:pffmtable:size=464, offset=349232 5:normproto:size=7862, offset=349696 13:shapetable:size=1084, offset=357558 23:version:size=12, offset=358642 -> generates eng.traineddata ``` I then decode real data using: ``` tesseract base64.png base64 --tessdata-dir ./ -l eng ``` However, when I go to decode a document that is very similar in dimensions and using the same font I still get errors (e.g. incorrect capitalisation or wrong letters entirely). How can I improve the accuracy? Thanks, Theo -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f75aa6ae-7be6-44ba-9223-d68a7c86f8ee%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

