>There is no good OCR for languages like Malayalam.

Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.

On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <[email protected]> wrote:

> Hi
>
> Here are the answers
>
> What does "converted to Unicode" mean? Converted from what exactly? Do
> > you maybe mean "converted via OCR (Optical character recognition) from
> > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > marking text to a file format which allows marking text in those files?
>
>
> There is no good OCR for languages like Malayalam. So each scanned image is
> manually typed and proofread  For example, See the 7th page of this book
> <http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>.
> You
> can see the scan image on the right and the transcribed text for that page
> on the left in the *Transcript *tab.  This is done for 136 books, and total
> pages on these books are close to 25,700 pages.
>
> What would you want the script to do exactly? Pull the files from the
> > Tuebingen Digital Library and then mass-upload these files to Commons?
>
>
> Yes, this is what is required. Unicode migration we will handle separately.
>
>
> Shiju Alex
>
>
>
>
>
> >
>
>
>
>
>
>
>
>
>
> On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[email protected]>
> wrote:
>
> > Hi,
> >
> > Great! Some questions below for better understanding what's wanted:
> >
> > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > > Recently Tuebingen University
> > > <https://uni-tuebingen.de/en/university.html> (with
> > > the support from German Research Foundation) ran a project titled
> > *Gundert
> > > Legacy project* to digitize close to 137,000 pages from *850 public
> > domain
> > > books*.
> > >
> > > All these public domain books are in the South Indian languages
> > *Malayalam,
> > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam,
> > 187
> > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> > >
> > > Also there was  a separate sub-project which was run as part of this
> > > project to convert 136 titles in Malayalam to Malayalam Unicode. The
> > number
> > > of pages that were converted to Unicode is close to *25,700* pages .The
> > > Unicode conversion project was ran only for Malayalam. For the other
> > > languages it is just the scanning of books
> >
> > What does "converted to Unicode" mean? Converted from what exactly? Do
> > you maybe mean "converted via OCR (Optical character recognition) from
> > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > marking text to a file format which allows marking text in those files?
> >
> > > The project is complete now and the results of the project is available
> > in
> > > the Hermman Gundert Portal https://www.gundert-portal.de/?language=en
> > which
> > > was released on Nov 20. A news report is available here.
> > > <
> >
> https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
> > >
> > >
> > > To view the books in each language you can navigate through the various
> > > links in the portal. For example, malayalam books are available here:
> > > https://www.gundert-portal.de/?page=malayalam
> > >
> > > Now we need to upload these scans to Wikimedia Commons and Unicode text
> > to
> > > Malayalam Wikisource (25,700 Unicode converted pages)
> > >
> > > The first priority is for the scans that are converted to Unicode. Is
> it
> > > possible to write a script to migrate the scans from Tuebingen Digital
> > > library to Wikimedia Commons? (I can share the exact details of books
> > > converted to Unicode if needed)
> >
> > What would you want the script to do exactly? Pull the files from the
> > Tuebingen Digital Library and then mass-upload these files to Commons?
> > OCR (identify letters in pure images and converting those letters to
> > text which could be marked and copied)? Something else?
> >
> > To convert image files available on Wikimedia Commons to recognized
> > text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
> > is also https://phabricator.wikimedia.org/T120788 for more info/tools.
> >
> > > All the digitized files are heavy and the size ranges from 100 MB to
> 1.5
> > GB
> > > depending on the number of pages in the books. So manually managing
> this
> > is
> > > going to be a big challenge.
> > >
> > > Can some one help with this?
> >
> > Cheers,
> > andre
> > --
> > Andre Klapper | Bugwrangler / Developer Advocate
> > https://blogs.gnome.org/aklapper/
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to