Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread Alex Brollo
I explored abbyy gx files, the full xml output from ABBYY ocr engine running at Internet Archive, and I've been astonished by the amount of data they contain - they are stored at XCA_Extended detaiI (as documented at http://www.abbyy-developers.com/en:tech:features:xml ). Something that

Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread Asaf Bartov
On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON vigneron.nico...@gmail.com wrote: Hi, I'm not a techie so I'm not sure to know what is OCR-as-service but you should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is behind tools like

Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread Asaf Bartov
On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni zanni.andre...@gmail.com wrote: uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice). Yes, the output is generally good. But as far as I can tell, the

[Wikisource-l] Another category loop (ES)

2015-07-12 Thread Sam Wilson
Does anyone mind that I keep posting these things? This time it's on es: [0] = Pedagogía_Tolteca - Categoría:ES-P [1] = Pedagogía_Tolteca - Categoría:Ensayos [2] = Pedagogía_Tolteca - Categoría:Ensayos_de_Guillermo_Marín_Ruiz [3] = Pedagogía_Tolteca - Categoría:Historia_de_México

Re: [Wikisource-l] Category browser

2015-07-12 Thread Arnd
Hi all, what is required to have de there as well? Arnd On 12/07/15 17:29, Nicolas VIGNERON wrote: 2015-07-12 4:59 GMT+02:00 Sam Wilson s...@samwilson.id.au mailto:s...@samwilson.id.au: It only re-runs the script weekly, or when I hit 'go'. I've hit go... and it's found another

Re: [Wikisource-l] Category browser

2015-07-12 Thread Arnd
Niclas, 1 and 3 are fine, for 2 and 4 the semantic is not clear for me. What does it mean? Arnd 2015-07-12 13:48 GMT+02:00 Arnd arnd.schroe...@gmail.com mailto:arnd.schroe...@gmail.com: Hi all, what is required to have de there as well? Arnd Arnd, could you confirm, this is right :

Re: [Wikisource-l] Category browser

2015-07-12 Thread Sam Wilson
The 'index_root' is the category in which Indexes are put when they're validated (i.e. proofread by at least two people). Perhaps for German it's actually Kategorie:Korrigiert? Or is that what proceeds Fertig? If the correct site link is added to https://www.wikidata.org/wiki/Q15634466 then

Re: [Wikisource-l] Category browser

2015-07-12 Thread Sam Wilson
On 12/07/15 19:48, Arnd wrote: Hi all, what is required to have de there as well? Arnd Good question! An addition to https://www.wikidata.org/wiki/Q15634466 is all. I'm afraid I don't know more about that Item. ricordisamoa pointed it out. It'd be great to get all Wikisources added there.

Re: [Wikisource-l] Category browser

2015-07-12 Thread Nicolas VIGNERON
2015-07-12 4:59 GMT+02:00 Sam Wilson s...@samwilson.id.au: It only re-runs the script weekly, or when I hit 'go'. I've hit go... and it's found another loop! This one on br: ( [0] = Jezuz-Krist_en_Breiz-Izel - Rummad:Contes_bretons [1] = Rummad:Contes_bretons - Rummad:Levrioù

Re: [Wikisource-l] OCR as a service?

2015-07-12 Thread billinghurst
OCR is available by a javascript. Numbers of wikisources have it enabled as a gadget, though I cannot speak for all the wikis. I presume it relates to the languages available in the OCR. Script is noted at https://wikisource.org/wiki/Wikisource:Shared_Scripts Regards, Billinghurst On Sun, Jul

Re: [Wikisource-l] Category browser

2015-07-12 Thread Nicolas VIGNERON
2015-07-12 13:48 GMT+02:00 Arnd arnd.schroe...@gmail.com: Hi all, what is required to have de there as well? Arnd Arnd, could you confirm, this is right : 'cat_label' = 'Kategorie', 'cat_root' = '!Hauptkategorie', 'index_ns' = 104, 'index_root' =

Re: [Wikisource-l] Category browser

2015-07-12 Thread Arnd
Kategorie:Fertig is correkt but it contains both indexes and pages. Thus, i get an error when updating the Wikidata item. The 'index_root' is the category in which Indexes are put when they're validated (i.e. proofread by at least two people). Perhaps for German it's actually