Re: [Wikisource-l] Analysed Layout and Text Object (ALTO)

2015-10-05 Thread billinghurst
I don't disagree that this should be part of our long term vision, and
those who can track this and advise the community on its development and
implementation. That said, I don't see how we would be exporting to this or
expanding to this in the wiki form.

I have concerns that we have so many basic issues unresolved, and little
developer time, as such the mundane tasks are not being addressed. :-/

Regards, Billinghurst

On Mon, Oct 5, 2015 at 10:04 PM Federico Leva (Nemo) 
wrote:

> I'm finding this document quite useful:
>
> http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_600555_WP4_D4.1_RecommendationsOnFormatsAndStandards_v1.1.pdf
>
> See description of ALTO pasted below, which is a followup to
>
> https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.html
> . We should find a way to convert the transcribed books' HTML to ALTO
> format. :)
>
> Some libraries are apparently using
> http://www.primaresearch.org/tools/Aletheia which seems an augmented
> (but unfree?!) version of ScanTailor with some different purpose.
>
> Nemo
>
> Principles
> ALTO stores layout information and OCR recognized text of pages of any
> kind of printed
> documents like books, journals and newspapers. ALTO can detail technical
> metadata for
> describing the layout and content of physical resources (text,
> illustrations, graphics).
> ALTO describes a content page with different views:
> The Description section helps to describe some general settings and
> information
> of the ALTO file (measurement units, file name, etc.), and the
> production process
> itself (processing steps, software used, dates and actors, etc.)
> The Layout section contains what‟s on the page. A page is divided into
> several
> regions (print space; left, right, top and bottom margins). For each
> region, all
> objects are listed which have been detected inside: text blocks,
> illustrations,
> graphical elements, composed blocks. Each object previously identified
> is defined
> by generic attributes: width, height, text content (for the String
> element).
> Besides, the reading order of all the elements can be managed.
> Each ALTO file may also contain a style section where different styles (for
> paragraphs and fonts) are listed.
> Use cases
> ALTO is one of the most common formats used by libraries for converting
> text from
> images. It‟s used both to deliver digitized contents and to preserve
> these contents.
> In a delivery perspective, the ability of ALTO to store the text content
> coordinates in a
> page allows the overlay of image and text (multilayer PDF) and highlight
> search words
> in a query.
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] Analysed Layout and Text Object (ALTO)

2015-10-05 Thread Alex Brollo
I apologyze for using Italian, my aim was to send a personal reply to Nemo.

Being a personal comment, it doesn't deserve an English translation, so
ingore it please.

Alex

2015-10-05 15:05 GMT+02:00 Alex Brollo :

> Interessante; una conferma della mia vecchia idea che il "cuore di
> wikisource" è il nsIndice, e l'unità di trascrizione +è la pagina in
> nsPagina ma è un'opinione isolata, sono stato contraddetto da chi (anche
> fra i wikisourciani di altissimo livello internazionale) è convinto che
> nsIndice e nsPagina siano unicamente "proofreading tools".
>
> Ovvio che la strutturazione xml dei contenuti, per quel poco che ho visto,
> richiama (è l'evoluzione?) della struttura TEI, ma vivendo dentro
> wikisource vedo che il "peccato originale" di non valorizzare nsPagina
> rischia di rendere le cose complesse, o impossibili, oltre ad aver disperso
> incredibili energie nella "transclusione".
>
> Le mie energie e il mio entusiasmo stanno scemando
>
> Alex
>
>
> 2015-10-05 13:04 GMT+02:00 Federico Leva (Nemo) :
>
>> I'm finding this document quite useful:
>> http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_600555_WP4_D4.1_RecommendationsOnFormatsAndStandards_v1.1.pdf
>>
>> See description of ALTO pasted below, which is a followup to
>> https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.html
>> . We should find a way to convert the transcribed books' HTML to ALTO
>> format. :)
>>
>> Some libraries are apparently using
>> http://www.primaresearch.org/tools/Aletheia which seems an augmented
>> (but unfree?!) version of ScanTailor with some different purpose.
>>
>> Nemo
>>
>> Principles
>> ALTO stores layout information and OCR recognized text of pages of any
>> kind of printed
>> documents like books, journals and newspapers. ALTO can detail technical
>> metadata for
>> describing the layout and content of physical resources (text,
>> illustrations, graphics).
>> ALTO describes a content page with different views:
>> The Description section helps to describe some general settings and
>> information
>> of the ALTO file (measurement units, file name, etc.), and the production
>> process
>> itself (processing steps, software used, dates and actors, etc.)
>> The Layout section contains what‟s on the page. A page is divided into
>> several
>> regions (print space; left, right, top and bottom margins). For each
>> region, all
>> objects are listed which have been detected inside: text blocks,
>> illustrations,
>> graphical elements, composed blocks. Each object previously identified is
>> defined
>> by generic attributes: width, height, text content (for the String
>> element).
>> Besides, the reading order of all the elements can be managed.
>> Each ALTO file may also contain a style section where different styles
>> (for
>> paragraphs and fonts) are listed.
>> Use cases
>> ALTO is one of the most common formats used by libraries for converting
>> text from
>> images. It‟s used both to deliver digitized contents and to preserve
>> these contents.
>> In a delivery perspective, the ability of ALTO to store the text content
>> coordinates in a
>> page allows the overlay of image and text (multilayer PDF) and highlight
>> search words
>> in a query.
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l