Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

David Cuenca Wed, 17 Jul 2013 13:14:05 -0700

I agree with Alex, the xml is not about getting editors to work with it,
but to improve the output of the text. If it can be combined with the
Visual Editor to add some pre-formatting and maybe signaling which words
are unclear, that would be already a big improvement.


If in addition to that, it can be used to compare proofread text with ocr
text for remapping purposes, even better.

Micru

On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo <[email protected]> wrote:

> Perhaps there's a misinterpretation - I mentioned abbyy.xml but with no
> project to import it as-it-is; abbyy.xml is only a surprising data
> container from which extract anything useful to speed up proofreading (and
> formatting) - nothing more than this.
>
> Just an example: vertical djvu coordinates of lines can be used to get
> font-size; horizontal coordinates of lines can be used to recognize
>  centered text; paragraphs splitting is valuable; coolumns can be
> recognized; margin too; with some effort probably poems can pop up.
>
> Far from simply importing  coordinates, it's a matter of use them at our
> best; no data, no data information contents.
>
> Alex
>
>
> 2013/7/17 Lars Aronsson <[email protected]>
>
>> On 07/17/2013 12:57 PM, Alex Brollo wrote:
>>
>>> FineReader OCR stores an incredibly detailed information in [...]
>>> abbyy.xml
>>>
>>
>> In the other end, Wikisource is a wiki that edits wiki text.
>> Sure, you could insert the XML there and let users
>> edit the XML, but that would scare more users away
>> and allow for more mistakes.
>>
>> For example, if proofreading Hamlet,
>>
>>   To be or not to bc, that is the question,
>>
>> anybody can easily spot "bc" and correct that.
>> In the XML version,
>>
>>  <word x=1 y=1>To</word>
>>  <word x=5 y=1>be</word>
>>  <word x=8 y=1>or</word>
>>
>> someone might think that "or" should be a litte more
>> to the right, so one user inserts a space between the
>> tag "<word x=8 y=1>" and "or", while another user
>> adjusts the tag to "<word x=9 y=1>". All the tags
>> make it harder to spot the OCR error "bc".
>>
>> Even if you replace manual XML editing with some
>> graphic tool, you get the same ambiguity between
>> adding whitespace and adjusting coordinates.
>>
>> This is a nightmare that we avoid by throwing away
>> all the coordinates and just proofreading the plain text.
>> It is not the perfect system, it's a compromise, in
>> order to get some useful work done.
>>
>>
>> --
>>   Lars Aronsson ([email protected])
>>   Project Runeberg - free Nordic literature - http://runeberg.org/
>>
>>
>>
>> ______________________________**_________________
>> Wikisource-l mailing list
>> [email protected].**org <[email protected]>
>> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l<https://lists.wikimedia.org/mailman/listinfo/wikisource-l>
>>
>
>
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>


-- 
Etiamsi omnes, ego non

_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Reply via email to