Re: [Wikitech-l] Book scans from Tuebingen Digital Library to Wikimedia Commons

Shiju Alex Mon, 03 Dec 2018 08:06:54 -0800

>
> Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
> the help you need than wikitech-l would be.


? I think the folks at commons are more likely to be able to give you


Thank you. I was not aware about this option. Let me try this.

Shiju Alex



On Mon, Dec 3, 2018 at 1:55 PM bawolff <[email protected]> wrote:

> Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
> ? I think the folks at commons are more likely to be able to give you
> the help you need than wikitech-l would be.
>
> --
> Brian
>
> On Mon, Dec 3, 2018 at 5:22 AM Shiju Alex <[email protected]>
> wrote:
> >
> > >
> > > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
> Vision
> > > API (which is usable from a Wikisource gadget
> > > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
> Wilson)
> > > will do OCR on Tamil. I can't vouch for these being "good", but they do
> > > exist.
> >
> >
> > The request in this post is not for creating an OCR for any language
> > script; but to migrate certain Public Domain book scans from Tuebingen
> > digital library to Wikimedia Commons.
> >
> > Also there is another task of migrating *already proofread Unicode text*
> to
> > Wikisource. But to take up the Unicode migration first the scans need to
> be
> > in Commons.
> >
> > I am making this request only because of the huge amount of pages that we
> > need to handle. If it was just few hundreds of pages volunteers would
> have
> > manually done it.
> >
> >
> > Shiju
> >
> >
> > On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <[email protected]>
> wrote:
> >
> > > >There is no good OCR for languages like Malayalam.
> > >
> > > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
> Vision
> > > API (which is usable from a Wikisource gadget
> > > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
> Wilson)
> > > will do OCR on Tamil. I can't vouch for these being "good", but they do
> > > exist.
> > >
> > > On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <[email protected]>
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > Here are the answers
> > > >
> > > > What does "converted to Unicode" mean? Converted from what exactly?
> Do
> > > > > you maybe mean "converted via OCR (Optical character recognition)
> from
> > > > > images in file formats (JPG, PNG, images in a PDF) which don't
> allow
> > > > > marking text to a file format which allows marking text in those
> files?
> > > >
> > > >
> > > > There is no good OCR for languages like Malayalam. So each scanned
> image
> > > is
> > > > manually typed and proofread  For example, See the 7th page of this
> book
> > > > <
> http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>.
> > > > You
> > > > can see the scan image on the right and the transcribed text for that
> > > page
> > > > on the left in the *Transcript *tab.  This is done for 136 books, and
> > > total
> > > > pages on these books are close to 25,700 pages.
> > > >
> > > > What would you want the script to do exactly? Pull the files from the
> > > > > Tuebingen Digital Library and then mass-upload these files to
> Commons?
> > > >
> > > >
> > > > Yes, this is what is required. Unicode migration we will handle
> > > separately.
> > > >
> > > >
> > > > Shiju Alex
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Great! Some questions below for better understanding what's wanted:
> > > > >
> > > > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > > > > > Recently Tuebingen University
> > > > > > <https://uni-tuebingen.de/en/university.html> (with
> > > > > > the support from German Research Foundation) ran a project titled
> > > > > *Gundert
> > > > > > Legacy project* to digitize close to 137,000 pages from *850
> public
> > > > > domain
> > > > > > books*.
> > > > > >
> > > > > > All these public domain books are in the South Indian languages
> > > > > *Malayalam,
> > > > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
> > > Malayalam,
> > > > > 187
> > > > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> > > > > >
> > > > > > Also there was  a separate sub-project which was run as part of
> this
> > > > > > project to convert 136 titles in Malayalam to Malayalam Unicode.
> The
> > > > > number
> > > > > > of pages that were converted to Unicode is close to *25,700*
> pages
> > > .The
> > > > > > Unicode conversion project was ran only for Malayalam. For the
> other
> > > > > > languages it is just the scanning of books
> > > > >
> > > > > What does "converted to Unicode" mean? Converted from what
> exactly? Do
> > > > > you maybe mean "converted via OCR (Optical character recognition)
> from
> > > > > images in file formats (JPG, PNG, images in a PDF) which don't
> allow
> > > > > marking text to a file format which allows marking text in those
> files?
> > > > >
> > > > > > The project is complete now and the results of the project is
> > > available
> > > > > in
> > > > > > the Hermman Gundert Portal
> > > https://www.gundert-portal.de/?language=en
> > > > > which
> > > > > > was released on Nov 20. A news report is available here.
> > > > > > <
> > > > >
> > > >
> > >
> https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
> > > > > >
> > > > > >
> > > > > > To view the books in each language you can navigate through the
> > > various
> > > > > > links in the portal. For example, malayalam books are available
> here:
> > > > > > https://www.gundert-portal.de/?page=malayalam
> > > > > >
> > > > > > Now we need to upload these scans to Wikimedia Commons and
> Unicode
> > > text
> > > > > to
> > > > > > Malayalam Wikisource (25,700 Unicode converted pages)
> > > > > >
> > > > > > The first priority is for the scans that are converted to
> Unicode. Is
> > > > it
> > > > > > possible to write a script to migrate the scans from Tuebingen
> > > Digital
> > > > > > library to Wikimedia Commons? (I can share the exact details of
> books
> > > > > > converted to Unicode if needed)
> > > > >
> > > > > What would you want the script to do exactly? Pull the files from
> the
> > > > > Tuebingen Digital Library and then mass-upload these files to
> Commons?
> > > > > OCR (identify letters in pure images and converting those letters
> to
> > > > > text which could be marked and copied)? Something else?
> > > > >
> > > > > To convert image files available on Wikimedia Commons to recognized
> > > > > text, see https://tools.wmflabs.org/ws-google-ocr/ for example.
> There
> > > > > is also https://phabricator.wikimedia.org/T120788 for more
> info/tools.
> > > > >
> > > > > > All the digitized files are heavy and the size ranges from 100
> MB to
> > > > 1.5
> > > > > GB
> > > > > > depending on the number of pages in the books. So manually
> managing
> > > > this
> > > > > is
> > > > > > going to be a big challenge.
> > > > > >
> > > > > > Can some one help with this?
> > > > >
> > > > > Cheers,
> > > > > andre
> > > > > --
> > > > > Andre Klapper | Bugwrangler / Developer Advocate
> > > > > https://blogs.gnome.org/aklapper/
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Wikitech-l mailing list
> > > > > [email protected]
> > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > [email protected]
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Book scans from Tuebingen Digital Library to Wikimedia Commons

Reply via email to