Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread billinghurst
Apologies for not replying earlier.

I have managed to get the attention of WMF staff and they have pushed this
to the right section within WMF to talk to Google.

I suggest that we give them a week to get their head around the issues, and
be able to ask questions.

This falls into the important, though not screamingly urgent, category.

We should have a Phabricator ticket for this. So we can track better.

--Billinghurst

On Mon, 22 Feb 2016 03:13 Bodhisattwa Mandal 
wrote:

> Hi,
>
> Of course, I am aware that Google' s goal does not match with ours. But I
> am talking about possibility of any negotiation in this matter because we
> don't have other options but to use the Google OCR tool. If we had other
> better OCR options, I would not raise the issue.
>
> By the way, we are not using Cloud Vision API for the script now, so still
> we are doing it without paying any money, but this shows that may be in
> near future, we have to pay them. I am just being cautious in advance.
>
> There may or may not be any negotiation, either way, we will utilise the
> Google OCR fully as far as we can. We will find other ways to do it.
>
> Regards,
> On Feb 21, 2016 9:04 PM, "Mathieu Stumpf Guntz" <
> psychosl...@culture-libre.org> wrote:
>
>>
>>
>> Le 20/02/2016 13:07, Federico Leva (Nemo) a écrit :
>>
>>> Bodhisattwa Mandal, 19/02/2016 18:02:
>>>
 And when we were getting some hope, Google announced that they will
 charge for doing OCR using their drive.
 https 

>>>
>>> Makes sense.
>>>
>>>
 Is there any chance that WMF will go for negotiation with Google so that
 we can do the mass OCR free of charge?

>>>
>>> What makes you think that Google's goals may match with ours?
>>>
>> Just asking will produce more certainty than speculating on matching
>> agendas. :)
>>
>>>
>>> Nemo
>>>
>>> ___
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Nicolas VIGNERON
2016-02-21 22:58 GMT+01:00 Federico Leva (Nemo) :

> Nicolas VIGNERON, 21/02/2016 21:39:
>
>> Maybe there is other alternatives but no one has pointed even the
>> beginning of an option.
>>
>
> It would be easier to point options if we had answers to the basic
> question "for which languages exactly do we need another OCR?".
>
> https://lists.wikimedia.org/pipermail/wikisource-l/2016-February/002712.html
>
> If people on this list are unable/unwilling to answer, can someone suggest
> where else/how to get/build an answer?
>

It seems to me that the answer have been already given :
Ideally all the wikisources need an OCR and in particular the indic
language have no free OCR (AFAIK) ; Bodhisattwa pointed to
http://wiki.wikimedia.in/List_of_Indian_language_wiki_projects last
december on the community wishlist and you give the other half of the
answer yesterday.

Meanwhile, I agree with your very last part : we should put this somewhere
public (on oldwikisource ? on meta ?) to have a broader view and gather
more insight on the subject (not only indic).

Cdlt, ~nicolas
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Federico Leva (Nemo)

Nicolas VIGNERON, 21/02/2016 21:39:

Maybe there is other alternatives but no one has pointed even the
beginning of an option.


It would be easier to point options if we had answers to the basic 
question "for which languages exactly do we need another OCR?".

https://lists.wikimedia.org/pipermail/wikisource-l/2016-February/002712.html

If people on this list are unable/unwilling to answer, can someone 
suggest where else/how to get/build an answer?


Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Nicolas VIGNERON
2016-02-21 21:21 GMT+01:00 Andrea Zanni :

>
> On Sun, Feb 21, 2016 at 9:01 PM, Federico Leva (Nemo) 
> wrote:
>
>> This is not true, of course. There is always an alternative, the question
>> is which alternative is worth pursuing.
>
>
>
> Nemo, please be reasonable.
>

+1. We need to be realistic.

Right now, and as far as I know, the only known alternatives are :
- do nothing (it is indeed an alternative but a very bad one),
- create a new OCR from scratch (probably the best option in the long run ;
but something that will took at least years and a huge amount of resources
nobody have ; not even FineReader, a big professional company which exist
for 27 years and have more than 2000 employees).

Maybe there is other alternatives but no one has pointed even the beginning
of an option.

Cdlt, ~nicolas
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Bodhisattwa Mandal
Hi,

Another alternative is development of open source good quality OCR or
improve existing ones. Many tried in India and Bangladesh to create OCR by
taking Government funds but no one knows what happened to those projects.
We approached some of them but either they were reluctant to show the
results or they did not bother. WMF and WMIN were also approached to
develop the OCR, but we were said that they possess no such infrastructure
and expertise to run the project.

Besides, developing new OCR will take a lot of time and we can't postpone
our Wikisource projects based on it. We have already waited for a long time
for a good quality OCR. Few months ago, we were typing every page of a
novel word by word and that was our only way of proofreading. :-) But
that's past now.

We always hope to get better alternatives and if we find any, we will
definitely try to pursue it.

Regards,
On Feb 22, 2016 1:32 AM, "Federico Leva (Nemo)"  wrote:

> Bodhisattwa Mandal, 21/02/2016 17:13:
>
>> we don't have other options but to use the Google OCR tool.
>>
>
> This is not true, of course. There is always an alternative, the question
> is which alternative is worth pursuing.
>
> Nemo
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Andrea Zanni
On Sun, Feb 21, 2016 at 9:01 PM, Federico Leva (Nemo) 
wrote:

> This is not true, of course. There is always an alternative, the question
> is which alternative is worth pursuing.



Nemo, please be reasonable.
If members from Indic communities say that there is an issue, and they have
know that for years and tried to cope with it in different ways, and now
they found that the Google OCR is finally available and working, we (who
don't know the problem first-hand) should just shut up and listen. As a
community, we should not presume just the good faith, sometimes also *the
best knowledge* in things we don't even the basic literacy to understand.

I personally don't find any ethical issue in the WMF talking to Google
about this: *language equity* (meaning a fundamental equality between
languages) is a value per se, a value we should treasure as an
international community.

So, statements like yours are probably said in good faith and spirit (I
presume that because I know you and your rare, precious dedication to
Wikimedia) but are not easy for others to understand.
In the end, they are not helpful and feel harsh and negative.
So please abstain or try to elaborate your point in a constructive,
emphatic way.
Thanks.

Aubrey
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Federico Leva (Nemo)

Bodhisattwa Mandal, 21/02/2016 17:13:

we don't have other options but to use the Google OCR tool.


This is not true, of course. There is always an alternative, the 
question is which alternative is worth pursuing.


Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Bodhisattwa Mandal
Hi,

Of course, I am aware that Google' s goal does not match with ours. But I
am talking about possibility of any negotiation in this matter because we
don't have other options but to use the Google OCR tool. If we had other
better OCR options, I would not raise the issue.

By the way, we are not using Cloud Vision API for the script now, so still
we are doing it without paying any money, but this shows that may be in
near future, we have to pay them. I am just being cautious in advance.

There may or may not be any negotiation, either way, we will utilise the
Google OCR fully as far as we can. We will find other ways to do it.

Regards,
On Feb 21, 2016 9:04 PM, "Mathieu Stumpf Guntz" <
psychosl...@culture-libre.org> wrote:

>
>
> Le 20/02/2016 13:07, Federico Leva (Nemo) a écrit :
>
>> Bodhisattwa Mandal, 19/02/2016 18:02:
>>
>>> And when we were getting some hope, Google announced that they will
>>> charge for doing OCR using their drive.
>>> https 
>>>
>>
>> Makes sense.
>>
>>
>>> Is there any chance that WMF will go for negotiation with Google so that
>>> we can do the mass OCR free of charge?
>>>
>>
>> What makes you think that Google's goals may match with ours?
>>
> Just asking will produce more certainty than speculating on matching
> agendas. :)
>
>>
>> Nemo
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-21 Thread Mathieu Stumpf Guntz



Le 20/02/2016 13:07, Federico Leva (Nemo) a écrit :

Bodhisattwa Mandal, 19/02/2016 18:02:

And when we were getting some hope, Google announced that they will
charge for doing OCR using their drive.
https 


Makes sense.



Is there any chance that WMF will go for negotiation with Google so that
we can do the mass OCR free of charge?


What makes you think that Google's goals may match with ours?
Just asking will produce more certainty than speculating on matching 
agendas. :)


Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l



___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-20 Thread Federico Leva (Nemo)
Trying to answer myself with some cleaning. So, which existing or 
potential Wikisources lack OCR with FineReader (and Tesseract?) but are 
interested in Google's, beyond Bengali?


== Languages supported by FineReader, not Google ==

Abkhaz
Adyghe
Agul
Altaic
Avar
Blackfoot
Bugotu
Buryat
Chamorro
Chukchee
Corsican
Crow
Dargwa
Dungan
Dutch (Belgium)
Eskimo (Cyrillic, Latin)
Even
Evenki
Frisian
Friulian
Gagauz
German (Luxemburg)
German (old spelling)
Hani
Ido
Ingush
Interlingua
Jingpo
Kabardian
Kalmyk
Karachay-balkar
Kasub
Kawa
Khakass
Khanty
Korean (Hangul)
Koryak
Kpelle
Kumyk
Kurdish
Lak
Lezgi
Luba
Malinke
Mansi
Mari
Maya
Miao
Moldavian
Mordvin
Nenets
Nivkh
Nogay
Norwegian (Nynorsk)
Occidental
Ojibway
Ossetian
Provencal
Rhaeto-romanic
Rwanda
Sami (Lappish)
Selkup
Somali
Sorbian
Sotho
Sunda
Tabasaran
Tagalog
Tok Pisin
Tun
Turkmen (Latin)
Tuvinian
Udmurt
Uighur (Cyrillic, Latin)
Ukrainian
Yakut

== Languages supported by Google, not FineReader ==

Acehnese
Acholi
Adangme
Akan
Algonquinian
Amharic
Ancient Greek
Araucanian/Mapuche
Assamese
Asturian
Athabaskan
Balinese
Bambara
Bantu
Batak
Bengali
Bikol
Bislama
Bosnian
Burmese
Cherokee
Chinese (Mandarin; Hong Kong)
Choctaw
Cree
Creek
Dhivehi
Duala
Dzonkha
Efik
Ewe
Filipino
Fon
Fulah
Ga
Gayo
Georgian
Gilbertese
Gothic
Gujarati
Haitian Creole
Herero
Hiligaynon
Hindi
Iban
Igbo
Iloko
Javanese
Kabyle
Kachin
Kalaallisut
Kamba
Kannada
Kanuri
Khasi
Khmer
Kinyarwanda
Komi
Kosraean
Kuanyama
Lao
Lingala
Low German
Lozi
Luba-Katanga
Luo
Madurese
Malayalam
Mandingo
Manx
Marathi
Marshallese
Mende
Middle English
Middle High German
Mongo
Navajo
Ndonga
Nepali
Niuean
Northern Sotho
North Ndebele
Nyankole
Nyasa Tonga
Nzima
Ojibwa
Old English
Old French
Old High German
Old Norse
Old Provencal
Oriya
Ossetic
Pampanga
Pangasinan
Pashto
Persian
Punjabi (Gurmukhi)
Romansh
Sakha
Sango
Sanskrit
Scots
Sinhala
Songhai
Southern Sotho
Sundanese
Tamil
Telugu
Temne
Tibetan
Tigirinya
Tsonga
Udmurt Ukrainian
Urdu
Venda
Votic
Western Frisian
Yoruba
Abkhaz
Adyghe
Afrikaans
Agul
Albanian
Altaic
Arabic
Armenian
Avar
Aymara
Azerbaijani
Azerbaijani (Cyrillic; old orthography)
Bashkir
Basque
Belarusian
Bemba
Blackfoot
Breton
Bugotu
Bulgarian
Buryat
Catalan
Cebuano
Chamorro
Chechen
Chinese (Simplified; Mandarin)
Chinese (Traditional; Mandarin)
Chukchee
Chuvash
Corsican
Crimean Tatar
Croatian
Crow
Czech
Dakota
Danish
Dargwa
Dungan
Dutch
Dutch (Belgium)
English
Eskimo (Cyrillic, Latin)
Esperanto
Estonian
Even
Evenki
Faroese
Fijian
Finnish
French
Frisian
Friulian
Gagauz
Galician
Ganda
German
German (Luxemburg)
German (old spelling)
Greek
Guarani
Hani
Hausa
Hawaiian
Hebrew
Hungarian
Icelandic
Ido
Indonesian
Ingush
Interlingua
Irish
Italian
Japanese
Jingpo
Kabardian
Kalmyk
Karachay-balkar
Karakalpak
Kasub
Kawa
Kazakh
Khakass
Khanty
Kikuyu
Kirghiz
Kongo
Korean
Korean (Hangul)
Koryak
Kpelle
Kumyk
Kurdish
Lak
Latin
Latvian
Lezgi
Lithuanian
Luba
Macedonian
Malagasy
Malay
Malinke
Maltese
Mansi
Maori
Mari
Maya
Miao
Minangkabau
Mohawk
Moldavian
Mongol
Mordvin
Nahuatl
Nenets
Nivkh
Nogay
Norwegian (Bokmål)
Norwegian (Nynorsk)
Nyanja
Occidental
Occitan
Ojibway
Ossetian
Papiamento
Polish
Portuguese (Brazil)
Portuguese (Portugal)
Provencal
Quechua
Rhaeto-romanic
Romanian
Romany
Rundi
Russian
Russian (old spelling)
Rwanda
Sami (Lappish)
Samoan
Scottish Gaelic
Selkup
Serbian (Cyrillic)
Serbian (Latin)
Shona
Slovak
Slovenian
Somali
Sorbian
Sotho
Spanish
Sunda
Swahili
Swazi
Swedish
Tabasaran
Tagalog
Tahitian
Tajik
Tatar
Thai
Tok Pisin
Tongan
Tswana
Tun
Turkish
Turkmen (Cyrillic)
Turkmen (Latin)
Tuvinian
Udmurt
Uighur (Cyrillic, Latin)
Ukrainian
Uzbek
Uzbek (Cyrillic)
Vietnamese
Welsh
Wolof
Xhosa
Yakut
Yiddish
Zapotec
Zulu
Acehnese
Acholi
Adangme
Afrikaans
Akan
Albanian
Algonquinian
Amharic
Ancient Greek
Arabic
Araucanian/Mapuche
Armenian
Assamese
Asturian
Athabaskan
Aymara
Azerbaijani
Azerbaijani (Cyrillic; old orthography)
Balinese
Bambara
Bantu
Bashkir
Basque
Batak
Belarusian
Bemba
Bengali
Bikol
Bislama
Bosnian
Breton
Bulgarian
Burmese
Catalan
Cebuano
Chechen
Cherokee
Chinese (Mandarin; Hong Kong)
Chinese (Simplified; Mandarin)
Chinese (Traditional; Mandarin)
Choctaw
Chuvash
Cree
Creek
Crimean Tatar
Croatian
Czech
Dakota
Danish
Dhivehi
Duala
Dutch
Dzonkha
Efik
English
Esperanto
Estonian
Ewe
Faroese
Fijian
Filipino
Finnish
Fon
French
Fulah
Ga
Galician
Ganda
Gayo
Georgian
German
Gilbertese
Gothic
Greek
Guarani
Gujarati
Haitian Creole
Hausa
Hawaiian
Hebrew
Herero
Hiligaynon
Hindi
Hungarian
Iban
Icelandic
Igbo
Iloko
Indonesian
Irish
Italian
Japanese
Javanese
Kabyle
Kachin
Kalaallisut
Kamba
Kannada
Kanuri
Karakalpak
Kazakh
Khasi
Khmer
Kikuyu
Kinyarwanda
Kirghiz
Komi
Kongo
Korean
Kosraean
Kuanyama
Lao
Latin
Latvian
Lingala
Lithuanian
Low German
Lozi
Luba-Katanga
Luo
Macedonian
Madurese
Malagasy
Malay
Malayalam
Maltese
Mandingo
Manx
Maori
Marathi
Marshallese
Mende
Middle English
Middle High German
Minangkabau
Mohawk
Mongo
Mongol
Nahuatl
Navajo
Ndonga
Nepali
Niuean
Northern Sotho
North Ndebele
Norwegian (Bokmål)
Nyanja
Nyankole
N

Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-20 Thread Federico Leva (Nemo)

Bodhisattwa Mandal, 19/02/2016 18:02:

And when we were getting some hope, Google announced that they will
charge for doing OCR using their drive.
https 


Makes sense.



Is there any chance that WMF will go for negotiation with Google so that
we can do the mass OCR free of charge?


What makes you think that Google's goals may match with ours?

Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-02-19 Thread Bodhisattwa Mandal
Hi,

The OCR4Wikisource script is evolving heavily. Already more than 1,50,000
pages have been OCRed in both Tamil and Bengali Wikisource using the
OCR4Wikisource script. The idea and the tool proved to be a game-changer
for Indic Wikisource projects.

And when we were getting some hope, Google announced that they will charge
for doing OCR using their drive.
https ://
cloud.google.com
/vision/


Is there any chance that WMF will go for negotiation with Google so that we
can do the mass OCR free of charge? I remember Asaf once told that this
possibility can be pursued. I think, now is the time to do that.

Regards,
Yeah!
I'm really happy that the BUB tool is resurrecting, and for the new OCR
script. Thanks everyone!

Aubrey

On Tue, Jan 5, 2016 at 9:53 PM, Asaf Bartov  wrote:

> On Tue, Jan 5, 2016 at 10:29 AM, Bodhisattwa Mandal <
> bodhisattwa.rg...@gmail.com> wrote:
>
>> Hi,
>>
>> I am happy to inform, that Shrinivasan has created a python script to
>> automate the process in Linux system. This scripts upload the PDF files to
>> Google Drive, download the OCRed text and split, merge the text files
>> properly to fit as the PDF file. We have just tested the script for small
>> files in Kannad and Bengali Wikisource and it was successful. We are going
>> to test the script for using different types and sizes of files and in
>> other Indic languages in next few days.
>>
>> The script is in https://github.com/tshrinivasan/OCR4wikisource
>>
>
> Fantastic news!
>
>A.
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-01-06 Thread Andrea Zanni
Yeah!
I'm really happy that the BUB tool is resurrecting, and for the new OCR
script. Thanks everyone!

Aubrey

On Tue, Jan 5, 2016 at 9:53 PM, Asaf Bartov  wrote:

> On Tue, Jan 5, 2016 at 10:29 AM, Bodhisattwa Mandal <
> bodhisattwa.rg...@gmail.com> wrote:
>
>> Hi,
>>
>> I am happy to inform, that Shrinivasan has created a python script to
>> automate the process in Linux system. This scripts upload the PDF files to
>> Google Drive, download the OCRed text and split, merge the text files
>> properly to fit as the PDF file. We have just tested the script for small
>> files in Kannad and Bengali Wikisource and it was successful. We are going
>> to test the script for using different types and sizes of files and in
>> other Indic languages in next few days.
>>
>> The script is in https://github.com/tshrinivasan/OCR4wikisource
>>
>
> Fantastic news!
>
>A.
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-01-05 Thread Asaf Bartov
On Tue, Jan 5, 2016 at 10:29 AM, Bodhisattwa Mandal <
bodhisattwa.rg...@gmail.com> wrote:

> Hi,
>
> I am happy to inform, that Shrinivasan has created a python script to
> automate the process in Linux system. This scripts upload the PDF files to
> Google Drive, download the OCRed text and split, merge the text files
> properly to fit as the PDF file. We have just tested the script for small
> files in Kannad and Bengali Wikisource and it was successful. We are going
> to test the script for using different types and sizes of files and in
> other Indic languages in next few days.
>
> The script is in https://github.com/tshrinivasan/OCR4wikisource
>

Fantastic news!

   A.
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-01-05 Thread Rohit Dua
Hi

Just wanted to introduce the bub tool on tools lab. It downloads the books
from google-books nd some other libraries and then uploads it to the
Internet archive for OCR. (after that tpt's ia-upload tool can be used for
commons upload)
The tool was down for a long time, but its getting ready again.(few fixes
more needed)

Hope it'll be useful for the community again.

-
Rohit
On 6 Jan 2016 01:22, "Federico Leva (Nemo)"  wrote:

> Bodhisattwa Mandal, 01/12/2015 16:35:
>
>> 3) I cannot tell about other Indic languages, but I can say that Bengali
>> is not included in FineReader version of IA.
>>
>
> Ok, thanks for answering my question. What other languages are we
> interested in that are missing?
> See http://www.abbyy.com/support/finereader/11/rl/ for the list.
>
> Nemo
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-01-05 Thread Federico Leva (Nemo)

Bodhisattwa Mandal, 01/12/2015 16:35:

3) I cannot tell about other Indic languages, but I can say that Bengali
is not included in FineReader version of IA.


Ok, thanks for answering my question. What other languages are we 
interested in that are missing?

See http://www.abbyy.com/support/finereader/11/rl/ for the list.

Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-01-05 Thread Mathieu Stumpf Guntz

Great, thank you for the news and congratulation for this achievement. :)

Le 05/01/2016 19:29, Bodhisattwa Mandal a écrit :

Hi,

I am happy to inform, that Shrinivasan has created a python script to 
automate the process in Linux system. This scripts upload the PDF 
files to Google Drive, download the OCRed text and split, merge the 
text files properly to fit as the PDF file. We have just tested the 
script for small files in Kannad and Bengali Wikisource and it was 
successful. We are going to test the script for using different types 
and sizes of files and in other Indic languages in next few days.


The script is in https://github.com/tshrinivasan/OCR4wikisource

Regards,
Bodhisattwa


On 2 December 2015 at 17:21, Tobias Schönberg > wrote:


I think it is important for non-technical readers of this list to
separate the 2 issues in the discussion.

1) OCR-Integration
This is something WMF can help with, because they can make the
connection between an OCR service and Mediawiki easier and
automate certain steps.

2) OCR
WMF is not programming an OCR-software and it would probably be a
bad idea to reinvent the wheel. It would be far better if editors
reached out to existing ORC-software projects. Starting a
discussion or filing a bug is an important first step in improving
the situation.
Tesseract-OCR (https://github.com/tesseract-ocr) for example is an
open-source project that works on OCR (No bugs filed for e.g.
Bengali). The mailing list
(https://groups.google.com/forum/#!forum/tesseract-ocr
)
contains discussions about e.g. Bengali
(https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali
).
So I think the situation might not be good, but is certainly on
its way of getting better.
Maybe WMF-India can fund a developer to work on Tesseract-OCR.
Another idea would be, to reach out to local universities. Maybe a
few informatics-students can improve the situation.

-Tobias


2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ)
mailto:viswapra...@gmail.com>>:

From that page which, Alex has linked:
"On the other hand, using the service for converting document
formats /is/ SaaSS, because it's something you could have done
by running a suitable program (free, one hopes) in your own
computer."

Hundreds among us have burnt their hands in developing a
successful 'free' OCR tool for Indic languages without any
real luck until now.
Until such a tool appears on the horizon, the Google facility
is just okay to be used.

Especially so, because we are anyway dealing with 'free' input
and output material.

-Viswaprabha



On 1 December 2015 at 21:49, Bodhisattwa Mandal
mailto:bodhisattwa.rg...@gmail.com>> wrote:

Hi Alex,

Of course, building free OCR can be the only permanent
solution, but WMF is not interested in building new OCR
right now. The language engineering team said at the
conference that, they don't have the infrastructure and
expertise to build such software. That's why, we have to
rely on Google OCR, knowing very well about its profit
making intentions. It's just a temporary solution but
right now, its the only best possible alternative for us.

Regards
Bodhisattwa

On 1 Dec 2015 21:12, "Alex Brollo" mailto:alex.bro...@gmail.com>> wrote:

... nevertheless I found very interesting this
about "SaaSS":

https://www.gnu.org/philosophy/who-does-that-server-really-serve.html


So, to build a true, excellent and indipendent
"wikisource multilingual OCR service" would be a
better solution.

Alex

2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal
mailto:bodhisattwa.rg...@gmail.com>>:

Hi Nemo,

Thanks for your interest. You can find the list of
Google OCR supported languages in the following link -

https://support.google.com/drive/answer/176692?hl=en

Regards,
Bodhisattwa

Thanks for posting about the topic. Which indic
languages are we talking about exactly? Are they
included in the recent FineReader versions now
used by Internet Archive?

Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
   

Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2016-01-05 Thread Bodhisattwa Mandal
Hi,

I am happy to inform, that Shrinivasan has created a python script to
automate the process in Linux system. This scripts upload the PDF files to
Google Drive, download the OCRed text and split, merge the text files
properly to fit as the PDF file. We have just tested the script for small
files in Kannad and Bengali Wikisource and it was successful. We are going
to test the script for using different types and sizes of files and in
other Indic languages in next few days.

The script is in https://github.com/tshrinivasan/OCR4wikisource

Regards,
Bodhisattwa


On 2 December 2015 at 17:21, Tobias Schönberg  wrote:

> I think it is important for non-technical readers of this list to separate
> the 2 issues in the discussion.
>
> 1) OCR-Integration
> This is something WMF can help with, because they can make the connection
> between an OCR service and Mediawiki easier and automate certain steps.
>
> 2) OCR
> WMF is not programming an OCR-software and it would probably be a bad idea
> to reinvent the wheel. It would be far better if editors reached out to
> existing ORC-software projects. Starting a discussion or filing a bug is an
> important first step in improving the situation.
> Tesseract-OCR (https://github.com/tesseract-ocr) for example is an
> open-source project that works on OCR (No bugs filed for e.g. Bengali). The
> mailing list (https://groups.google.com/forum/#!forum/tesseract-ocr)
> contains discussions about e.g. Bengali (
> https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali). So I
> think the situation might not be good, but is certainly on its way of
> getting better.
> Maybe WMF-India can fund a developer to work on Tesseract-OCR. Another
> idea would be, to reach out to local universities. Maybe a few
> informatics-students can improve the situation.
>
> -Tobias
>
>
> 2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ) 
> :
>
>> From that page which, Alex has linked:
>> "On the other hand, using the service for converting document formats
>> *is* SaaSS, because it's something you could have done by running a
>> suitable program (free, one hopes) in your own computer."
>>
>> Hundreds among us have burnt their hands in developing a successful
>> 'free' OCR tool for Indic languages without any real luck until now.
>> Until such a tool appears on the horizon, the Google facility is just
>> okay to be used.
>>
>> Especially so, because we are anyway dealing with 'free' input and output
>> material.
>>
>> -Viswaprabha
>>
>>
>>
>> On 1 December 2015 at 21:49, Bodhisattwa Mandal <
>> bodhisattwa.rg...@gmail.com> wrote:
>>
>>> Hi Alex,
>>>
>>> Of course, building free OCR can be the only permanent solution, but WMF
>>> is not interested in building new OCR right now. The language engineering
>>> team said at the conference that, they don't have the infrastructure and
>>> expertise to build such software. That's why, we have to rely on Google
>>> OCR, knowing very well about its profit making intentions. It's just a
>>> temporary solution but right now, its the only best possible alternative
>>> for us.
>>>
>>> Regards
>>> Bodhisattwa
>>> On 1 Dec 2015 21:12, "Alex Brollo"  wrote:
>>>
 ... nevertheless I found very interesting this about "SaaSS":
 https://www.gnu.org/philosophy/who-does-that-server-really-serve.html

 So, to build a true, excellent and indipendent "wikisource multilingual
 OCR service" would be a better solution.

 Alex

 2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal <
 bodhisattwa.rg...@gmail.com>:

> Hi Nemo,
>
> Thanks for your interest. You can find the list of Google OCR
> supported languages in the following link -
>
> https://support.google.com/drive/answer/176692?hl=en
>
> Regards,
> Bodhisattwa
> Thanks for posting about the topic. Which indic languages are we
> talking about exactly? Are they included in the recent FineReader versions
> now used by Internet Archive?
>
> Nemo
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>

 ___
 Wikisource-l mailing list
 Wikisource-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l


>>> ___
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> 

Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-02 Thread Tobias Schönberg
I think it is important for non-technical readers of this list to separate
the 2 issues in the discussion.

1) OCR-Integration
This is something WMF can help with, because they can make the connection
between an OCR service and Mediawiki easier and automate certain steps.

2) OCR
WMF is not programming an OCR-software and it would probably be a bad idea
to reinvent the wheel. It would be far better if editors reached out to
existing ORC-software projects. Starting a discussion or filing a bug is an
important first step in improving the situation.
Tesseract-OCR (https://github.com/tesseract-ocr) for example is an
open-source project that works on OCR (No bugs filed for e.g. Bengali). The
mailing list (https://groups.google.com/forum/#!forum/tesseract-ocr)
contains discussions about e.g. Bengali (
https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali). So I
think the situation might not be good, but is certainly on its way of
getting better.
Maybe WMF-India can fund a developer to work on Tesseract-OCR. Another idea
would be, to reach out to local universities. Maybe a few
informatics-students can improve the situation.

-Tobias


2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ) :

> From that page which, Alex has linked:
> "On the other hand, using the service for converting document formats *is*
> SaaSS, because it's something you could have done by running a suitable
> program (free, one hopes) in your own computer."
>
> Hundreds among us have burnt their hands in developing a successful 'free'
> OCR tool for Indic languages without any real luck until now.
> Until such a tool appears on the horizon, the Google facility is just okay
> to be used.
>
> Especially so, because we are anyway dealing with 'free' input and output
> material.
>
> -Viswaprabha
>
>
>
> On 1 December 2015 at 21:49, Bodhisattwa Mandal <
> bodhisattwa.rg...@gmail.com> wrote:
>
>> Hi Alex,
>>
>> Of course, building free OCR can be the only permanent solution, but WMF
>> is not interested in building new OCR right now. The language engineering
>> team said at the conference that, they don't have the infrastructure and
>> expertise to build such software. That's why, we have to rely on Google
>> OCR, knowing very well about its profit making intentions. It's just a
>> temporary solution but right now, its the only best possible alternative
>> for us.
>>
>> Regards
>> Bodhisattwa
>> On 1 Dec 2015 21:12, "Alex Brollo"  wrote:
>>
>>> ... nevertheless I found very interesting this about "SaaSS":
>>> https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
>>>
>>> So, to build a true, excellent and indipendent "wikisource multilingual
>>> OCR service" would be a better solution.
>>>
>>> Alex
>>>
>>> 2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal <
>>> bodhisattwa.rg...@gmail.com>:
>>>
 Hi Nemo,

 Thanks for your interest. You can find the list of Google OCR supported
 languages in the following link -

 https://support.google.com/drive/answer/176692?hl=en

 Regards,
 Bodhisattwa
 Thanks for posting about the topic. Which indic languages are we
 talking about exactly? Are they included in the recent FineReader versions
 now used by Internet Archive?

 Nemo

 ___
 Wikisource-l mailing list
 Wikisource-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 ___
 Wikisource-l mailing list
 Wikisource-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l


>>>
>>> ___
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread വിശ്വപ്രഭ
>From that page which, Alex has linked:
"On the other hand, using the service for converting document formats *is*
SaaSS, because it's something you could have done by running a suitable
program (free, one hopes) in your own computer."

Hundreds among us have burnt their hands in developing a successful 'free'
OCR tool for Indic languages without any real luck until now.
Until such a tool appears on the horizon, the Google facility is just okay
to be used.

Especially so, because we are anyway dealing with 'free' input and output
material.

-Viswaprabha



On 1 December 2015 at 21:49, Bodhisattwa Mandal  wrote:

> Hi Alex,
>
> Of course, building free OCR can be the only permanent solution, but WMF
> is not interested in building new OCR right now. The language engineering
> team said at the conference that, they don't have the infrastructure and
> expertise to build such software. That's why, we have to rely on Google
> OCR, knowing very well about its profit making intentions. It's just a
> temporary solution but right now, its the only best possible alternative
> for us.
>
> Regards
> Bodhisattwa
> On 1 Dec 2015 21:12, "Alex Brollo"  wrote:
>
>> ... nevertheless I found very interesting this about "SaaSS":
>> https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
>>
>> So, to build a true, excellent and indipendent "wikisource multilingual
>> OCR service" would be a better solution.
>>
>> Alex
>>
>> 2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal <
>> bodhisattwa.rg...@gmail.com>:
>>
>>> Hi Nemo,
>>>
>>> Thanks for your interest. You can find the list of Google OCR supported
>>> languages in the following link -
>>>
>>> https://support.google.com/drive/answer/176692?hl=en
>>>
>>> Regards,
>>> Bodhisattwa
>>> Thanks for posting about the topic. Which indic languages are we talking
>>> about exactly? Are they included in the recent FineReader versions now used
>>> by Internet Archive?
>>>
>>> Nemo
>>>
>>> ___
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>> ___
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Bodhisattwa Mandal
Hi Alex,

Of course, building free OCR can be the only permanent solution, but WMF is
not interested in building new OCR right now. The language engineering team
said at the conference that, they don't have the infrastructure and
expertise to build such software. That's why, we have to rely on Google
OCR, knowing very well about its profit making intentions. It's just a
temporary solution but right now, its the only best possible alternative
for us.

Regards
Bodhisattwa
On 1 Dec 2015 21:12, "Alex Brollo"  wrote:

> ... nevertheless I found very interesting this about "SaaSS":
> https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
>
> So, to build a true, excellent and indipendent "wikisource multilingual
> OCR service" would be a better solution.
>
> Alex
>
> 2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal  >:
>
>> Hi Nemo,
>>
>> Thanks for your interest. You can find the list of Google OCR supported
>> languages in the following link -
>>
>> https://support.google.com/drive/answer/176692?hl=en
>>
>> Regards,
>> Bodhisattwa
>> Thanks for posting about the topic. Which indic languages are we talking
>> about exactly? Are they included in the recent FineReader versions now used
>> by Internet Archive?
>>
>> Nemo
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Alex Brollo
... nevertheless I found very interesting this about "SaaSS":
https://www.gnu.org/philosophy/who-does-that-server-really-serve.html

So, to build a true, excellent and indipendent "wikisource multilingual OCR
service" would be a better solution.

Alex

2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal :

> Hi Nemo,
>
> Thanks for your interest. You can find the list of Google OCR supported
> languages in the following link -
>
> https://support.google.com/drive/answer/176692?hl=en
>
> Regards,
> Bodhisattwa
> Thanks for posting about the topic. Which indic languages are we talking
> about exactly? Are they included in the recent FineReader versions now used
> by Internet Archive?
>
> Nemo
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Bodhisattwa Mandal
Hi Nemo,

1) Indic languages are basically all languages of Indian subcontinent like
Hindi, Sanskrit, Urdu, Punjabi, Gujarati, Marathi, Tamil, Telugu, Kannad,
Malayalam, Bengali, Odia, Assamese etc.

2)My specific interest is in Bengali Language.

3) I cannot tell about other Indic languages, but I can say that Bengali is
not included in FineReader version of IA.

Regards,
Bodhisattwa
On 1 Dec 2015 20:57, "Bodhisattwa Mandal" 
wrote:

> Hi Nemo,
>
> Please follow this link also,
>
>
> http://cis-india.org/a2k/blogs/googles-optical-character-recognition-software-now-works-with-all-south-asian-languages
>
> Regards,
> Bodhisattwa
> On 1 Dec 2015 20:29, "Federico Leva (Nemo)"  wrote:
>
>> Thanks for posting about the topic. Which indic languages are we talking
>> about exactly? Are they included in the recent FineReader versions now used
>> by Internet Archive?
>>
>> Nemo
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Bodhisattwa Mandal
Hi Nemo,

Please follow this link also,

http://cis-india.org/a2k/blogs/googles-optical-character-recognition-software-now-works-with-all-south-asian-languages

Regards,
Bodhisattwa
On 1 Dec 2015 20:29, "Federico Leva (Nemo)"  wrote:

> Thanks for posting about the topic. Which indic languages are we talking
> about exactly? Are they included in the recent FineReader versions now used
> by Internet Archive?
>
> Nemo
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Federico Leva (Nemo)

Bodhisattwa Mandal, 01/12/2015 16:06:

Thanks for your interest. You can find the list of Google OCR supported
languages in the following link -

https://support.google.com/drive/answer/176692?hl=en


Yes but that's very generic, for instance they don't  say what level of 
support they have. Most importantly, I was asking what languages are

a) Indic,
b) interesting for you, AND
c) not supported by Internet Archive (FineReader).

Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Bodhisattwa Mandal
Hi Nemo,

Thanks for your interest. You can find the list of Google OCR supported
languages in the following link -

https://support.google.com/drive/answer/176692?hl=en

Regards,
Bodhisattwa
Thanks for posting about the topic. Which indic languages are we talking
about exactly? Are they included in the recent FineReader versions now used
by Internet Archive?

Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Federico Leva (Nemo)
Thanks for posting about the topic. Which indic languages are we talking 
about exactly? Are they included in the recent FineReader versions now 
used by Internet Archive?


Nemo

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


[Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

2015-12-01 Thread Bodhisattwa Mandal
Hi,

For a long time Indic languages Wikisource projects depended totally
on manual proofreading, which not only wasted a lot of time, but also
a lot of energy. Recently Google has released OCR software for more
than 20 Indic languages, along with other Asian languages. This
software is far far better and accurate than the previous OCRs. But it
has many limitations. Uploading the same large file two times (one
time for Google OCR and another at Commons) is not an easy solution
for most of the contributors, as Internet connection is way slow in
India. Now if we develop a tool which can feed the uploaded pdf or
djvu files of Commons directly to Google OCRs, so that uploading them
2 times can be avoided.

This was proposed in 2015 community wishlist. Now, as the voting
procedure for the wishlist has been started, the proposal needs your
support. Please follow the link-

https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Wikisource#Tool_to_use_Google_OCRs_in_Indic_language_Wikisource

FYI, this proposal was also accepted as a highest priority need at the
2015 Wikisource Conference in Vienna.
(https://etherpad.wikimedia.org/p/wscon2015needs)

Regards
-- 
Bodhisattwa Mandal
Administrator, Bengali Wikipedia

''Imagine a world in which every single person on the planet is given
free access to the sum of all human knowledge.''

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l