Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread Alex Brollo
I explored abbyy gx files, the full xml output from ABBYY ocr engine
running at Internet Archive, and I've been astonished by the amount of data
they contain - they are stored at XCA_Extended  detaiI (as documented at
http://www.abbyy-developers.com/en:tech:features:xml ).

Something that wikisource best developers should explore; comparing those
data with the little bit of data into mapped text layer of djvu files is
impressive and should be inspiring.

But they are static data coming from a standard setting... nothing similar
to a service with simple, shared, deep learning features for difficult and
ancient texts. I tried "ancient italian" tesseract dictionary with very
poor results.

So Asaf, I can't wait for good news from you. :-)

Alex

2015-07-12 12:50 GMT+02:00 Andrea Zanni :

>
>
> On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov 
> wrote:
>
>> On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni 
>> wrote:
>>
>>> uh, that sounds very interesting.
>>> Right now, we mainly use OCR from djvu from Internet Archive (that means
>>> ABBYY Finereader, which is very nice).
>>>
>>
>> Yes, the output is generally good.  But as far as I can tell, the
>> archive's Open Library API does not offer a way to retrieve the OCR output
>> programmatically, and certainly not for an arbitrary page rather than the
>> whole item.  What I'm working on requires the ability to OCR a single page
>> on demand.
>>
>> True.
> I've recently met Giovanni, a new (italian) guy who's now working with
> Internet Archive and Open Library.
> We discussed about a number of possible parnerships/projects, this is
> definitely one to bring it up.
>
> But if we manage to do it directly in the Wikimedia world it's even
> better.
>
> Aubrey
>
>
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Arnd
Kategorie:Fertig is correkt but it contains both indexes and pages.
Thus, i get an error when updating the Wikidata item.

> The 'index_root' is the category in which Indexes are put when they're
> validated (i.e. proofread by at least two people).
>
> Perhaps for German it's actually Kategorie:Korrigiert? Or is that what
> proceeds Fertig?
>
> If the correct site link is added to
> https://www.wikidata.org/wiki/Q15634466 then the tool will pick it up
> from there.
>
> —sam.
>
> PS And '2' below is the 'root category', i.e. the topmost category of all.
>
>
> On 12/07/15 20:18, Arnd wrote:
>> Niclas, 1 and 3 are fine, for 2 and 4 the semantic is not clear for
>> me. What does it mean? Arnd
>>
>>>
>>>
>>> 2015-07-12 13:48 GMT+02:00 Arnd >> >:
>>>
>>> Hi all, what is required to have "de" there as well? Arnd
>>>
>>>
>>> Arnd, could you confirm, this is right :
>>>
>>> 'cat_label'  => 'Kategorie',
>>> 'cat_root'   => '!Hauptkategorie',
>>> 'index_ns'   => 104,
>>> 'index_root' => 'Fertig',
>>>
>>> I'm not sure for the last since it's not linked on Q15634466
>>> 
>>>
>>> Cdlt, ~nicolas
>>>
>>>
>>> ___
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread billinghurst
Sam,

Would you mind checking the code for output, it seems to break on
apostrophes, for example the line
  Mrs. Caudle's curtain lectures [Download EPUB]
gives a link
  https://en.wikisource.org/wiki/Mrs._Caudle (wrong)
rather than
  https://en.wikisource.org/wiki/Mrs._Caudle%27s_curtain_lectures

Thanks. Regards, Billinghurst

On Fri, Jul 10, 2015 at 10:31 PM Sam Wilson  wrote:

> Two things about http://tools.wmflabs.org/ws-cat-browser/ —
>
> 1. I've changed the ownership of this tool, and it's now at
> https://github.com/wikisource/ws-cat-browser
>
> 2. It's slightly multi-lingual now. At least, it allows browsing of the
> Italian categories now. All the UI text is still in English I'm afraid.
> I'd like to add languages, but need to know the names of 'validated
> works' and root-level categories (e.g. for French it's perhaps
> "Catégorie:100%", but I'm not really sure; that might be old).
>
> Thanks,
> Sam.
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Sam Wilson
The 'index_root' is the category in which Indexes are put when they're 
validated (i.e. proofread by at least two people).


Perhaps for German it's actually Kategorie:Korrigiert? Or is that what 
proceeds Fertig?


If the correct site link is added to 
https://www.wikidata.org/wiki/Q15634466 then the tool will pick it up 
from there.


—sam.

PS And '2' below is the 'root category', i.e. the topmost category of all.


On 12/07/15 20:18, Arnd wrote:
Niclas, 1 and 3 are fine, for 2 and 4 the semantic is not clear for 
me. What does it mean? Arnd





2015-07-12 13:48 GMT+02:00 Arnd >:


Hi all, what is required to have "de" there as well? Arnd


Arnd, could you confirm, this is right :

'cat_label'  => 'Kategorie',
'cat_root'   => '!Hauptkategorie',
'index_ns'   => 104,
'index_root' => 'Fertig',

I'm not sure for the last since it's not linked on Q15634466 



Cdlt, ~nicolas


___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l




___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Arnd
Niclas, 1 and 3 are fine, for 2 and 4 the semantic is not clear for me.
What does it mean? Arnd

>
>
> 2015-07-12 13:48 GMT+02:00 Arnd  >:
>
> Hi all, what is required to have "de" there as well? Arnd
>
>
> Arnd, could you confirm, this is right :
>
> 'cat_label'  => 'Kategorie',
> 'cat_root'   => '!Hauptkategorie',
> 'index_ns'   => 104,
> 'index_root' => 'Fertig',
>
> I'm not sure for the last since it's not linked on Q15634466
> 
>
> Cdlt, ~nicolas
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Nicolas VIGNERON
2015-07-12 13:48 GMT+02:00 Arnd :

>  Hi all, what is required to have "de" there as well? Arnd
>

Arnd, could you confirm, this is right :

'cat_label'  => 'Kategorie',
'cat_root'   => '!Hauptkategorie',
'index_ns'   => 104,
'index_root' => 'Fertig',

I'm not sure for the last since it's not linked on Q15634466


Cdlt, ~nicolas
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Sam Wilson



On 12/07/15 19:48, Arnd wrote:

Hi all, what is required to have "de" there as well? Arnd


Good question!

An addition to https://www.wikidata.org/wiki/Q15634466 is all.

I'm afraid I don't know more about that Item. ricordisamoa pointed it out.

It'd be great to get all Wikisources added there. :-)

—sam.
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Arnd
Hi all, what is required to have "de" there as well? Arnd

>
>
> On 12/07/15 17:29, Nicolas VIGNERON wrote:
>>
>>
>> 2015-07-12 4:59 GMT+02:00 Sam Wilson > >:
>>
>> It only re-runs the script weekly, or when I hit 'go'. I've hit
>> go... and it's found another loop! This one on br:
>>
>> (
>> [0] => Jezuz-Krist_en_Breiz-Izel - Rummad:Contes_bretons
>> [1] => Rummad:Contes_bretons - Rummad:Levrioù
>> [2] => Rummad:Levrioù - Rummad:Pennrummad
>> [3] => Rummad:Levrioù - Rummad:Rummadoù
>> [4] => Rummad:Rummadoù - Rummad:Rummadoù
>> )
>>
>> Wow, nasty loop ! I've completely deleted Rummad:Rummadoù
>>
>> Cdlt, ~nicolas
>
> Oh, brilliant nicolas! Thank you. :) It's re-running now.
>
> Does anyone know whether Czech Wikisource has any validated,
> categorized works? The silly tool is telling me no.
>
> The script now is fully clever eh, and is automatically figuring
> things out for itself. It now knows about:
>
>  1. br 
>  2. ca 
>  3. cs 
>  4. da 
>  5. es 
>  6. fa 
>  7. fr 
>  8. is 
>  9. it 
> 10. no 
> 11. pt 
> 12. sv 
> 13. vec 
>
> (Or, at least, is pretending to. Gosh I feel like a
> linguistically-backward Australian sometimes. I mean, I *am* one of
> those...)
>
> -sam.
>
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


[Wikisource-l] Another category loop (ES)

2015-07-12 Thread Sam Wilson

Does anyone mind that I keep posting these things? This time it's on es:

[0] => Pedagogía_Tolteca - Categoría:ES-P
[1] => Pedagogía_Tolteca - Categoría:Ensayos
[2] => Pedagogía_Tolteca - Categoría:Ensayos_de_Guillermo_Marín_Ruiz
[3] => Pedagogía_Tolteca - Categoría:Historia_de_México
[4] => Categoría:Historia_de_México - Categoría:Historia_por_países
[5] => Categoría:Historia_por_países - Categoría:Historia
[6] => Categoría:Historia - Categoría:Ciencias_humanísticas
[7] => Categoría:Historia - Categoría:Documentos_históricos
[8] => Categoría:Documentos_históricos - Categoría:Principal
[9] => Categoría:Documentos_históricos - 
Categoría:Índice_de_documentos_históricos
[10] => Categoría:Índice_de_documentos_históricos - 
Categoría:Índice_de_documentos_históricos


I could blunder in and post in english on their Scriptorium/Pub/Cafe, 
but that seems rather rude. :)


—sam.

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Sam Wilson



On 12/07/15 17:29, Nicolas VIGNERON wrote:



2015-07-12 4:59 GMT+02:00 Sam Wilson >:


It only re-runs the script weekly, or when I hit 'go'. I've hit
go... and it's found another loop! This one on br:

(
[0] => Jezuz-Krist_en_Breiz-Izel - Rummad:Contes_bretons
[1] => Rummad:Contes_bretons - Rummad:Levrioù
[2] => Rummad:Levrioù - Rummad:Pennrummad
[3] => Rummad:Levrioù - Rummad:Rummadoù
[4] => Rummad:Rummadoù - Rummad:Rummadoù
)

Wow, nasty loop ! I've completely deleted Rummad:Rummadoù

Cdlt, ~nicolas


Oh, brilliant nicolas! Thank you. :) It's re-running now.

Does anyone know whether Czech Wikisource has any validated, categorized 
works? The silly tool is telling me no.


The script now is fully clever eh, and is automatically figuring things 
out for itself. It now knows about:


1. br 
2. ca 
3. cs 
4. da 
5. es 
6. fa 
7. fr 
8. is 
9. it 
10. no 
11. pt 
12. sv 
13. vec 

(Or, at least, is pretending to. Gosh I feel like a 
linguistically-backward Australian sometimes. I mean, I *am* one of 
those...)


-sam.

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread Andrea Zanni
On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov  wrote:

> On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni 
> wrote:
>
>> uh, that sounds very interesting.
>> Right now, we mainly use OCR from djvu from Internet Archive (that means
>> ABBYY Finereader, which is very nice).
>>
>
> Yes, the output is generally good.  But as far as I can tell, the
> archive's Open Library API does not offer a way to retrieve the OCR output
> programmatically, and certainly not for an arbitrary page rather than the
> whole item.  What I'm working on requires the ability to OCR a single page
> on demand.
>
> True.
I've recently met Giovanni, a new (italian) guy who's now working with
Internet Archive and Open Library.
We discussed about a number of possible parnerships/projects, this is
definitely one to bring it up.

But if we manage to do it directly in the Wikimedia world it's even better.

Aubrey


>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread billinghurst
OCR is available by a javascript. Numbers of wikisources have it enabled as
a gadget, though I cannot speak for all the wikis. I presume it relates to
the languages available in the OCR.

Script is noted at
https://wikisource.org/wiki/Wikisource:Shared_Scripts

Regards, Billinghurst

On Sun, Jul 12, 2015 at 7:23 PM Asaf Bartov  wrote:

> On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON <
> vigneron.nico...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm not a techie so I'm not sure to know what is OCR-as-service but you
>> should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is
>> behind tools like http://tools.wmflabs.org/phetools/ocr.php ).
>>
>
> Thanks for the pointer!  I don't see any documentation on how to feed
> images to it, though, and no pointer to the source code to figure it out on
> my own.  Help?
>
> A.
> --
> Asaf Bartov
> Wikimedia Foundation 
>
> Imagine a world in which every single human being can freely share in the
> sum of all knowledge. Help us make it a reality!
> https://donate.wikimedia.org
>  ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Category browser

2015-07-12 Thread Nicolas VIGNERON
2015-07-12 4:59 GMT+02:00 Sam Wilson :

>  It only re-runs the script weekly, or when I hit 'go'. I've hit go... and
> it's found another loop! This one on br:
>
> (
> [0] => Jezuz-Krist_en_Breiz-Izel - Rummad:Contes_bretons
> [1] => Rummad:Contes_bretons - Rummad:Levrioù
> [2] => Rummad:Levrioù - Rummad:Pennrummad
> [3] => Rummad:Levrioù - Rummad:Rummadoù
> [4] => Rummad:Rummadoù - Rummad:Rummadoù
> )
>
> Wow, nasty loop ! I've completely deleted Rummad:Rummadoù

Cdlt, ~nicolas
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread Asaf Bartov
On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni 
wrote:

> uh, that sounds very interesting.
> Right now, we mainly use OCR from djvu from Internet Archive (that means
> ABBYY Finereader, which is very nice).
>

Yes, the output is generally good.  But as far as I can tell, the archive's
Open Library API does not offer a way to retrieve the OCR output
programmatically, and certainly not for an arbitrary page rather than the
whole item.  What I'm working on requires the ability to OCR a single page
on demand.

But ideally we could think of a "customizable" OCR software that gets
> trained language per language: htat would be extremely useful for
> Wiikisources.
>
> (i can also imagine to divide, inside every language, per centuries,
> because languages too changes over time ;-)
>

Indeed.

   A.
-- 
Asaf Bartov
Wikimedia Foundation 

Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread Asaf Bartov
On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON <
vigneron.nico...@gmail.com> wrote:

> Hi,
>
> I'm not a techie so I'm not sure to know what is OCR-as-service but you
> should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is
> behind tools like http://tools.wmflabs.org/phetools/ocr.php ).
>

Thanks for the pointer!  I don't see any documentation on how to feed
images to it, though, and no pointer to the source code to figure it out on
my own.  Help?

A.
-- 
Asaf Bartov
Wikimedia Foundation 

Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l