Hi Joe, Good question.
Gibberish is fine, though try to make it look somewhat plausible. I used exactly that approach when training Ancient Greek (which when one considers all the dicritics has a few hundred characters too). I wrote a shell script to output random words from the wordlist, ensuring that each character was used at least 5 times, which also added punctuation and uppercase letters in somewhat plausible (though nonsense) placements. I'll attach it, though it's probably somewhat script specific. I'm happy to answer any questions about how it works. I'll also attaching a little C program called "isupper" which it needs. I then fed the resulting text file into my lazytrain program[1], but there are several other programs that will create a box & image file from a text (see [2]). Best of luck, and let us know how you get on. Nick 1. http://www.dur.ac.uk/nick.white/tools/ 2. http://code.google.com/p/tesseract-ocr/wiki/AddOns -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en
/* * Copyright 2012 Nick White <[email protected]> * * Permission to use, copy, modify, and/or distribute this software for any * purpose with or without fee is hereby granted, provided that the above * copyright notice and this permission notice appear in all copies. * * Version 1.0 */ #define usage "isupper c\n\n" \ "Returns 0 if c is uppercase, and 1 otherwise.\n" #include <stdio.h> #include <stdlib.h> #include <string.h> #include <locale.h> #include <wchar.h> #include <wctype.h> int main(int argc, char *argv[]) { wchar_t w; if(argc != 2) { fputs(usage, stdout); return 2; } if (0 == setlocale(LC_CTYPE, "")) { fputs("Error: Locale is invalid.\n", stderr); return 2; } if(mbtowc(&w, argv[1], MB_CUR_MAX) == -1) { fprintf(stderr, "Error: Conversion of %s to wide character failed.\n", argv[1]); return 2; } return !iswupper(w); }
makegarbage.sh
Description: Bourne shell script

