I don't know how it works inside Wikisource, but at the very least
Tesseract has a confidence value (also called confidence score or level)
that will score how well it did OCR on a word (it also works at character
level). But for assessing that you normally need the hOCR result.

cheers,

El mar., 12 mar. 2019 a las 17:27, Lars Aronsson (<l...@aronsson.se>)
escribió:

> If you have a large digitization project, such as Wikisource,
> with many pages and books of scanned images and OCR text
> (originating from different sources and times),
> how do you assess the OCR quality and determine which pages
> are in most need of improved OCR or proofreading?
>
> Is spell checking (and a normal dictionary) the only useful tool?
> Would you count the number of spelling errors, or the ratio
> of errors to correct words? Has anyone done this?
>
>
> --
>    Lars Aronsson (l...@aronsson.se)
>    Project Runeberg - free Nordic literature - http://runeberg.org/
>
>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to