I don't know how it works inside Wikisource, but at the very least Tesseract has a confidence value (also called confidence score or level) that will score how well it did OCR on a word (it also works at character level). But for assessing that you normally need the hOCR result.
cheers, El mar., 12 mar. 2019 a las 17:27, Lars Aronsson (<[email protected]>) escribió: > If you have a large digitization project, such as Wikisource, > with many pages and books of scanned images and OCR text > (originating from different sources and times), > how do you assess the OCR quality and determine which pages > are in most need of improved OCR or proofreading? > > Is spell checking (and a normal dictionary) the only useful tool? > Would you count the number of spelling errors, or the ratio > of errors to correct words? Has anyone done this? > > > -- > Lars Aronsson ([email protected]) > Project Runeberg - free Nordic literature - http://runeberg.org/ > > > > _______________________________________________ > Wikisource-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikisource-l >
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
