sorry *10 to 20* minutes per page On Mon, Aug 24, 2020 at 12:43 PM J Hayes <[email protected]> wrote: > > yeah, as we know OCR is a pain point. > i have some success, using the google ocr button to get a better result > but i have also done hundreds of 2 column unzip edits, which can take > me 1 minutes per page. > > we have requested an improved OCR at wishlist, which would take a > comparison of proofread page versus text layer to drive an AI improved > text layer. but no support. maybe we should propose to internet > archive? > > cheers > > On Sat, Aug 22, 2020 at 6:12 PM Lars Aronsson <[email protected]> wrote: > > > > Apparently, Brewster Kahle wrote (via Federico Leva - Nemo): > > > <http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/> > > > > > > Take for example, this newspaper from 1847. The images > > > <https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1> > > > are not that great, but a person can read them: > > > > > > The problem is our computers’ optical character recognition tech gets > > > it wrong > > > <https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt>, > > > and the columns get confused. > > > > In my experience, working with ABBYY Finereader Professional, > > you always need to manually check columns / zoning. > > For just a few years of one newspaper, this might be a reasonable > > manual work. But the problem is the same for centuries of > > thousands of newspapers. > > > > When I scanned encyclopedias (printed in 2 columns in 20 > > volumes x 800 pages), I manually checked columns in the OCR > > program. > > > > For Wikisource, we would need a way for the OCR program to > > indicate how the zones (columns) are identified in the image, > > and let the wiki user modify these zones before sending > > each zone to the OCR program. It would be reasonable for > > the WMF to fund a developer (or team of developers) to create > > such a solution. There is already some solution for marking > > parts of a picture, right? This needs to work within pages of > > a PDF or Djvu file. > > > > > > -- > > Lars Aronsson ([email protected]) > > Linköping, Sweden > > > > Project Runeberg - free Nordic literature - http://runeberg.org/ > > > > > > > > _______________________________________________ > > Wikisource-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
