There is no need for an Either-Or approach, I think. While development of
effective OCR for Indian languages, (Sankarshan has given a comprehensive
overview of the developments on this post earlier), should be encouraged;
for the immediate need typing is still an effective option.

I read "cheaper" in the context of the humongous amounts spent on OCR by
various agencies, especially GOI, over the last decade and we are yet to
see effective results. Putting together a list of all the OCR developments
in Indian Languages and the expenses incurred (status paper?), esp. by the
GoI, could be also useful in thinking about future work on this.

Best,
Vishnu


On 21 August 2013 12:35, Dhaval S. Vyas <dsv...@gmail.com> wrote:

> While saying "cheaper" are we considering recurring cost of human labour
> (for which future is uncertain) or just taking in account the initial one
> off cost of software development?
>
> Regards,
> Dhaval
> On 21 Aug 2013 08:01, "Pavanaja U B" <pavan...@vishvakannada.com> wrote:
>
>> I second Tejaswini. Those who are working on Kannada OCR development also
>> say the same.****
>>
>> ** **
>>
>> Regards,****
>>
>> Pavanaja****
>>
>> ** **
>>
>> ** **
>>
>> *From:* wikimediaindia-l-boun...@lists.wikimedia.org [mailto:
>> wikimediaindia-l-boun...@lists.wikimedia.org] *On Behalf Of *Tejaswini
>> Niranjana
>> *Sent:* 21 August 2013 11:24
>> *To:* Wikimedia India Community list
>> *Subject:* Re: [Wikimediaindia-l] Indic print material digitization
>> workshop query****
>>
>> ** **
>>
>> Colleagues working in Bangla say that in their experience it is faster,
>> cheaper, and less error-prone to create digital texts by typing them in.
>> Once there is a larger body of digitised texts, and OCR technology for
>> Indian languages also improves, OCR could become the preferred option. **
>> **
>>
>> ** **
>>
>> Tejaswini****
>>
>> ** **
>>
>> On 19 August 2013 22:38, Aarti K. Dwivedi <ellydwivedi2...@gmail.com>
>> wrote:****
>>
>> Hi Everyone,****
>>
>> ** **
>>
>>      In my opinion, it is always better to OCR  the documents. I agree
>> that it's error prone but there is a****
>>
>> Google Summer of Code project being done by AnkurIndia whose aim is to
>> improve the quality of OCRs****
>>
>> for Indian scripts.
>> https://www.google-melange.com/gsoc/project/google/gsoc2013/knoxxs/5001**
>> **
>>
>> ** **
>>
>> So, maybe not immediately but in short time, OCR is worth it. I am not
>> aware if any Wikisource in Indian****
>>
>> languages is as vast as French, English or Italian Wikisource. But we
>> should have it because we have quite****
>>
>> a lot of text.****
>>
>> ** **
>>
>> Thank You,****
>>
>> Aarti****
>>
>> ** **
>>
>> On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur <
>> ashwin.bain...@gmail.com> wrote:****
>>
>> Whether to OCR or not to OCR is a significant issue! When we OCR a page
>> of text, the resultant is often error-prone, lost formatting, and the
>> correction requires crowd-sourced correction. Many of us know about Project
>> Gutenberg. The site provides plain vanilla etexts. But what most people do
>> not know that one of the very first crowd-sourcing initiatives -
>> "Distributed Proof-readers" provides a huge volunteer community correcting
>> OCR pages of text submitted to Project Gutenberg. In fact, I was a
>> Distributed Proofreader before coming to Wikipedia and that was my first
>> crowd-sourced experience.****
>>
>> ** **
>>
>> http://www.pgdp.net/c/****
>>
>> ** **
>>
>> I've also done digitisation in a government archive for five years. We
>> took a conscious decision to OCR the text and allow the uncorrected layer
>> to exist rather than take the pains to correct it. The material was used so
>> infrequently, it made good sense for the end-user to proof-read himself
>> should he desire to do so. So the real challenge in digitisation is not
>> OCR, or rather, not just OCR but the creation of an error-free proof-read
>> text layer behind the pdf/other formatted archive document.****
>>
>> ** **
>>
>> Ashwin Baindur****
>>
>> ** **
>>
>> On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara <
>> suma...@wikimedia.org> wrote:****
>>
>> On 08/19/2013 02:52 AM, L. Shyamal wrote:
>> > Re-posting a now outdated query from meta
>> >
>> http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalore/Digitization_workshop_18August2013
>> >
>> > now that the workshop has already been conducted I think those that have
>> > attended the workshop could comment if this cover Indic language
>> OCR-ing -
>> > if it did it would be worthwhile if the OCR software used can be
>> documented
>> > on the meta pages or elsewhere such as Wikisource. Most of the more
>> > experienced editors here will be fairly familiar with the use of
>> scanners
>> > for creating PDF documents and uploading them to places like the
>> Internet
>> > Archive but the experience or knowledge of OCRs and their success rates
>> is
>> > a bit wanting for Indic languages (fonts).
>> >
>> > best wishes
>> > Shyamal
>> > en:User:Shyamal****
>>
>> I looked at the talk page on Meta - thank you, Shyamal!
>>
>> For those who do not know: OCR means Optical Character Recognition.
>> When we want to get archival documents onto the web, it's nice to have
>> photos of them, but it's even better to OCR them so that people can
>> clearly read, copy, excerpt, translate, and remix the text.
>>
>> Is there a central list of the problems that OCR software (especially
>> open source OCR software) has with text written in Indic languages?  If
>> so, I could help encourage people to fix those problems, as volunteers,
>> via a Google Summer of Code/Outreach Program for Women internship, via a
>> grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG
>> ), or via some other method.
>>
>> People who would like to make Wikisource more easily useful for Indic
>> languages might want to contribute to the Wikisource vision development
>> project that's going on right now:
>>
>> https://wikisource.org/wiki/Wikisource_vision_development
>>
>> The ProofreadPage extension (part of the Wikisource technology stack) is
>> being worked on right now in Aarti K. Dwivedi's Google Summer of Code
>> internship.  http://aartindi.blogspot.in/  She might be interested in
>> knowing about these issues, so I am cc'ing her.
>>
>> Also - just because people on this list might be interested! - if you
>> have an old historical map that you'd like to vectorize to get it onto
>> OpenStreetMap, try out the new "Map polygon and feature extractor" tool:
>> https://github.com/NYPL/map-vectorizer
>>
>> --
>> Sumana Harihareswara
>> Engineering Community Manager
>> Wikimedia Foundation****
>>
>> ** **
>>
>> _______________________________________________
>> Wikimediaindia-l mailing list
>> Wikimediaindia-l@lists.wikimedia.org
>> To unsubscribe from the list / change mailing preferences visit
>> https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****
>>
>>
>>
>> ****
>>
>> ** **
>>
>> --
>> Warm regards,
>>
>> Ashwin Baindur
>> ------------------------------------------------------ ****
>>
>>
>> _______________________________________________
>> Wikimediaindia-l mailing list
>> Wikimediaindia-l@lists.wikimedia.org
>> To unsubscribe from the list / change mailing preferences visit
>> https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****
>>
>>
>>
>> ****
>>
>> ** **
>>
>> -- ****
>>
>> Aarti K. Dwivedi****
>>
>> ** **
>>
>>
>> _______________________________________________
>> Wikimediaindia-l mailing list
>> Wikimediaindia-l@lists.wikimedia.org
>> To unsubscribe from the list / change mailing preferences visit
>> https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****
>>
>>
>>
>> ****
>>
>> ** **
>>
>> -- ****
>>
>> Tejaswini Niranjana, PhD
>> Lead Researcher - Higher Education Innovation and Research Applications
>> (HEIRA)
>> Senior Fellow - Centre for the Study of Culture and Society (CSCS)
>> Visiting Professor - Tata Institute of Social Sciences (TISS)****
>>
>> Advisor, Access to Knowledge Programme, Centre for Internet and Society
>> Visiting Faculty - Centre for Contemporary Studies, Indian Institute
>> of Science (CCS-IISc)
>>
>> t: 91-80-41202302
>> http://heira.in
>> www.cscs.res.in****
>>
>> _______________________________________________
>> Wikimediaindia-l mailing list
>> Wikimediaindia-l@lists.wikimedia.org
>> To unsubscribe from the list / change mailing preferences visit
>> https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
>>
>>
> _______________________________________________
> Wikimediaindia-l mailing list
> Wikimediaindia-l@lists.wikimedia.org
> To unsubscribe from the list / change mailing preferences visit
> https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
>
>
_______________________________________________
Wikimediaindia-l mailing list
Wikimediaindia-l@lists.wikimedia.org
To unsubscribe from the list / change mailing preferences visit 
https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Reply via email to