sorry *10 to 20* minutes per page

On Mon, Aug 24, 2020 at 12:43 PM J Hayes <[email protected]> wrote:
>
> yeah, as we know OCR is a pain point.
> i have some success, using the google ocr button to get a better result
> but i have also done hundreds of 2 column unzip edits, which can take
> me 1 minutes per page.
>
> we have requested an improved OCR at wishlist, which would take a
> comparison of proofread page versus text layer to drive an AI improved
> text layer. but no support. maybe we should propose to internet
> archive?
>
> cheers
>
> On Sat, Aug 22, 2020 at 6:12 PM Lars Aronsson <[email protected]> wrote:
> >
> > Apparently, Brewster Kahle wrote (via Federico Leva - Nemo):
> > > <http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/>
> > >
> > > Take for example, this newspaper from 1847. The images
> > > <https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1>
> > > are not that great, but a person can read them:
> > >
> > > The problem is  our computers’ optical character recognition tech gets
> > > it wrong
> > > <https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt>,
> > > and the columns get confused.
> >
> > In my experience, working with ABBYY Finereader Professional,
> > you always need to manually check columns / zoning.
> > For just a few years of one newspaper, this might be a reasonable
> > manual work. But the problem is the same for centuries of
> > thousands of newspapers.
> >
> > When I scanned encyclopedias (printed in 2 columns in 20
> > volumes x 800 pages), I manually checked columns in the OCR
> > program.
> >
> > For Wikisource, we would need a way for the OCR program to
> > indicate how the zones (columns) are identified in the image,
> > and let the wiki user modify these zones before sending
> > each zone to the OCR program. It would be reasonable for
> > the WMF to fund a developer (or team of developers) to create
> > such a solution. There is already some solution for marking
> > parts of a picture, right? This needs to work within pages of
> > a PDF or Djvu file.
> >
> >
> > --
> >    Lars Aronsson ([email protected])
> >    Linköping, Sweden
> >
> >    Project Runeberg - free Nordic literature - http://runeberg.org/
> >
> >
> >
> > _______________________________________________
> > Wikisource-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to