> > Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading > the help you need than wikitech-l would be.
? I think the folks at commons are more likely to be able to give you Thank you. I was not aware about this option. Let me try this. Shiju Alex On Mon, Dec 3, 2018 at 1:55 PM bawolff <[email protected]> wrote: > Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading > ? I think the folks at commons are more likely to be able to give you > the help you need than wikitech-l would be. > > -- > Brian > > On Mon, Dec 3, 2018 at 5:22 AM Shiju Alex <[email protected]> > wrote: > > > > > > > > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google > Vision > > > API (which is usable from a Wikisource gadget > > > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam > Wilson) > > > will do OCR on Tamil. I can't vouch for these being "good", but they do > > > exist. > > > > > > The request in this post is not for creating an OCR for any language > > script; but to migrate certain Public Domain book scans from Tuebingen > > digital library to Wikimedia Commons. > > > > Also there is another task of migrating *already proofread Unicode text* > to > > Wikisource. But to take up the Unicode migration first the scans need to > be > > in Commons. > > > > I am making this request only because of the huge amount of pages that we > > need to handle. If it was just few hundreds of pages volunteers would > have > > manually done it. > > > > > > Shiju > > > > > > On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <[email protected]> > wrote: > > > > > >There is no good OCR for languages like Malayalam. > > > > > > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google > Vision > > > API (which is usable from a Wikisource gadget > > > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam > Wilson) > > > will do OCR on Tamil. I can't vouch for these being "good", but they do > > > exist. > > > > > > On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <[email protected]> > > > wrote: > > > > > > > Hi > > > > > > > > Here are the answers > > > > > > > > What does "converted to Unicode" mean? Converted from what exactly? > Do > > > > > you maybe mean "converted via OCR (Optical character recognition) > from > > > > > images in file formats (JPG, PNG, images in a PDF) which don't > allow > > > > > marking text to a file format which allows marking text in those > files? > > > > > > > > > > > > There is no good OCR for languages like Malayalam. So each scanned > image > > > is > > > > manually typed and proofread For example, See the 7th page of this > book > > > > < > http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. > > > > You > > > > can see the scan image on the right and the transcribed text for that > > > page > > > > on the left in the *Transcript *tab. This is done for 136 books, and > > > total > > > > pages on these books are close to 25,700 pages. > > > > > > > > What would you want the script to do exactly? Pull the files from the > > > > > Tuebingen Digital Library and then mass-upload these files to > Commons? > > > > > > > > > > > > Yes, this is what is required. Unicode migration we will handle > > > separately. > > > > > > > > > > > > Shiju Alex > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[email protected] > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > Great! Some questions below for better understanding what's wanted: > > > > > > > > > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote: > > > > > > Recently Tuebingen University > > > > > > <https://uni-tuebingen.de/en/university.html> (with > > > > > > the support from German Research Foundation) ran a project titled > > > > > *Gundert > > > > > > Legacy project* to digitize close to 137,000 pages from *850 > public > > > > > domain > > > > > > books*. > > > > > > > > > > > > All these public domain books are in the South Indian languages > > > > > *Malayalam, > > > > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in > > > Malayalam, > > > > > 187 > > > > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu. > > > > > > > > > > > > Also there was a separate sub-project which was run as part of > this > > > > > > project to convert 136 titles in Malayalam to Malayalam Unicode. > The > > > > > number > > > > > > of pages that were converted to Unicode is close to *25,700* > pages > > > .The > > > > > > Unicode conversion project was ran only for Malayalam. For the > other > > > > > > languages it is just the scanning of books > > > > > > > > > > What does "converted to Unicode" mean? Converted from what > exactly? Do > > > > > you maybe mean "converted via OCR (Optical character recognition) > from > > > > > images in file formats (JPG, PNG, images in a PDF) which don't > allow > > > > > marking text to a file format which allows marking text in those > files? > > > > > > > > > > > The project is complete now and the results of the project is > > > available > > > > > in > > > > > > the Hermman Gundert Portal > > > https://www.gundert-portal.de/?language=en > > > > > which > > > > > > was released on Nov 20. A news report is available here. > > > > > > < > > > > > > > > > > > > > https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms > > > > > > > > > > > > > > > > > > To view the books in each language you can navigate through the > > > various > > > > > > links in the portal. For example, malayalam books are available > here: > > > > > > https://www.gundert-portal.de/?page=malayalam > > > > > > > > > > > > Now we need to upload these scans to Wikimedia Commons and > Unicode > > > text > > > > > to > > > > > > Malayalam Wikisource (25,700 Unicode converted pages) > > > > > > > > > > > > The first priority is for the scans that are converted to > Unicode. Is > > > > it > > > > > > possible to write a script to migrate the scans from Tuebingen > > > Digital > > > > > > library to Wikimedia Commons? (I can share the exact details of > books > > > > > > converted to Unicode if needed) > > > > > > > > > > What would you want the script to do exactly? Pull the files from > the > > > > > Tuebingen Digital Library and then mass-upload these files to > Commons? > > > > > OCR (identify letters in pure images and converting those letters > to > > > > > text which could be marked and copied)? Something else? > > > > > > > > > > To convert image files available on Wikimedia Commons to recognized > > > > > text, see https://tools.wmflabs.org/ws-google-ocr/ for example. > There > > > > > is also https://phabricator.wikimedia.org/T120788 for more > info/tools. > > > > > > > > > > > All the digitized files are heavy and the size ranges from 100 > MB to > > > > 1.5 > > > > > GB > > > > > > depending on the number of pages in the books. So manually > managing > > > > this > > > > > is > > > > > > going to be a big challenge. > > > > > > > > > > > > Can some one help with this? > > > > > > > > > > Cheers, > > > > > andre > > > > > -- > > > > > Andre Klapper | Bugwrangler / Developer Advocate > > > > > https://blogs.gnome.org/aklapper/ > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Wikitech-l mailing list > > > > > [email protected] > > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > _______________________________________________ > > > > Wikitech-l mailing list > > > > [email protected] > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > > > Wikitech-l mailing list > > > [email protected] > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > _______________________________________________ > > Wikitech-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
