Re: [Wikisource-l] Assessing OCR quality

scann Tue, 12 Mar 2019 13:39:47 -0700

I don't know how it works inside Wikisource, but at the very least
Tesseract has a confidence value (also called confidence score or level)
that will score how well it did OCR on a word (it also works at character
level). But for assessing that you normally need the hOCR result.


cheers,

El mar., 12 mar. 2019 a las 17:27, Lars Aronsson (<[email protected]>)
escribió:

> If you have a large digitization project, such as Wikisource,
> with many pages and books of scanned images and OCR text
> (originating from different sources and times),
> how do you assess the OCR quality and determine which pages
> are in most need of improved OCR or proofreading?
>
> Is spell checking (and a normal dictionary) the only useful tool?
> Would you count the number of spelling errors, or the ratio
> of errors to correct words? Has anyone done this?
>
>
> --
>    Lars Aronsson ([email protected])
>    Project Runeberg - free Nordic literature - http://runeberg.org/
>
>
>
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>

_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Re: [Wikisource-l] Assessing OCR quality

Reply via email to