Replying to myself, but looking at the FAQ, perhaps the unicharambigs file is the way to go for simple replacements? https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#the-unicharambigs-file
On Saturday, June 25, 2016 at 10:37:45 AM UTC-4, Titus Barik wrote: > > Thanks! Applying the simple 3x linear scaling to the image improved the > results of recognition dramatically. The output is now: > > Java - > commons-collections4/src/main/java/org/apache/commons/collections4/Iist/LazyListjava > > - Eclipse > > In the word "/list/", it is actually being recognized as a capital I > (eye), not a lowercase l (el). This is not a huge problem, but I'm > wondering if there's a simple way to correct for these sorts of issues in > tesseract. With a user dictionary or user patterns? Otherwise, I can just > fix these on a case-by-case basis using an external Python script. > > On Friday, June 24, 2016 at 4:01:42 AM UTC-4, Stef wrote: >> >> >> You could try and scale up the image before OCR. See section "Scale text >> up" here <https://stb-tester.com/blog/2014/04/14/improving-ocr-accuracy>. >> >> Stef >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3b1c603b-6821-43bd-a31d-89656919b447%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

