Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Andrea Zanni Fri, 19 Jul 2013 00:20:28 -0700

On Fri, Jul 19, 2013 at 8:13 AM, Thibaut Horel <[email protected]>wrote:


>  I don't see the possibility of directly editing the ABBYY xml file
> happening any time soon. In theory, it should be possible, since that is
> somewhat similar to what Visual Editor is doing: providing a WYSIWYG
> interface to edit structured data (html+rdf in VE's case). But that's a
> (very) long-term plan, and its relevance is not even clear to me. In this
> regard, I agree with what David and Alex said.
>
> Still, there are two things we could do with these xml files:
>
> * extract information beyond the raw text to do some pre-formatting prior
> to the page creation: this could include paragraphs, centered texts etc.
> Some good OCR/layout detection softwares are even able to detect font
> information, like bold or italic. However, and I could be wrong here, it
> seems to me that the impact of such pre-formatting would be limited: when
> proofreading, most of the time is spent correcting OCR mistakes, the
> formatting can be made on-the-go and has an almost negligible time cost.
>

I still think that doing most of the work automatically (if possible) would
be a good idea. I actually like formatting (eg bold, italics) much more
than I like proofreading OCR, but I also think that the less burden we give
our proofreaders the better it is.
I mean, if I'm proofreading a text, and I see the text is already well
formatted, it saves time: if it's formatted badly, I can still correct it,
right?


> * import the proofread text back into the xml file. By doing so, we would
> recover the position of words across the page for the proofread text. This
> would allow us to provide PDFs with a curated text layer. Such PDFs would
> be truly and fully searchable, which I think would be highly valuable for
> bibliophiles. This task somehow requires to align two texts: map each word
> in the proofread text to one word in the original ABBY file (this is not
> entirely accurate since two words are sometimes recognized as a single word
> by the OCR, and vice versa). I have a few ideas on how to properly solve
> this problem: it is actually very similar (and even simpler!) to the
> so-called "phrase alignment" problem found in machine translation and
> natural language processing and the probabilistic models it uses could
> easily be adapted to our problem. I know that some attempts have been made
> in the past to tackle this problem, but I don't have a clear view of what
> has been tried exactly, and how successful the attempts were. I would
> highly appreciate any information you could have about this.
>
> I think Seb35 studied a bit the subject few years ago, with all the
probabilistic things and markovian chains and funny stuff you all like :-)
(I always amazes me how many mathematicians or like are involved in
Wikisource. My conclusion is that we like to put order in abstract spaces.

Aubrey



> Thibaut
>
>
> On 07/17/2013 10:13 PM, David Cuenca wrote:
>
> I agree with Alex, the xml is not about getting editors to work with it,
> but to improve the output of the text. If it can be combined with the
> Visual Editor to add some pre-formatting and maybe signaling which words
> are unclear, that would be already a big improvement.
>
> If in addition to that, it can be used to compare proofread text with ocr
> text for remapping purposes, even better.
>
> Micru
>
> On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo <[email protected]>wrote:
>
>> Perhaps there's a misinterpretation - I mentioned abbyy.xml but with no
>> project to import it as-it-is; abbyy.xml is only a surprising data
>> container from which extract anything useful to speed up proofreading (and
>> formatting) - nothing more than this.
>>
>>  Just an example: vertical djvu coordinates of lines can be used to get
>> font-size; horizontal coordinates of lines can be used to recognize
>>  centered text; paragraphs splitting is valuable; coolumns can be
>> recognized; margin too; with some effort probably poems can pop up.
>>
>>  Far from simply importing  coordinates, it's a matter of use them at
>> our best; no data, no data information contents.
>>
>> Alex
>>
>>
>>  2013/7/17 Lars Aronsson <[email protected]>
>>
>>> On 07/17/2013 12:57 PM, Alex Brollo wrote:
>>>
>>>> FineReader OCR stores an incredibly detailed information in [...]
>>>> abbyy.xml
>>>>
>>>
>>> In the other end, Wikisource is a wiki that edits wiki text.
>>> Sure, you could insert the XML there and let users
>>> edit the XML, but that would scare more users away
>>> and allow for more mistakes.
>>>
>>> For example, if proofreading Hamlet,
>>>
>>>   To be or not to bc, that is the question,
>>>
>>> anybody can easily spot "bc" and correct that.
>>> In the XML version,
>>>
>>>  <word x=1 y=1>To</word>
>>>  <word x=5 y=1>be</word>
>>>  <word x=8 y=1>or</word>
>>>
>>> someone might think that "or" should be a litte more
>>> to the right, so one user inserts a space between the
>>> tag "<word x=8 y=1>" and "or", while another user
>>> adjusts the tag to "<word x=9 y=1>". All the tags
>>> make it harder to spot the OCR error "bc".
>>>
>>> Even if you replace manual XML editing with some
>>> graphic tool, you get the same ambiguity between
>>> adding whitespace and adjusting coordinates.
>>>
>>> This is a nightmare that we avoid by throwing away
>>> all the coordinates and just proofreading the plain text.
>>> It is not the perfect system, it's a compromise, in
>>> order to get some useful work done.
>>>
>>>
>>> --
>>>   Lars Aronsson ([email protected])
>>>   Project Runeberg - free Nordic literature - http://runeberg.org/
>>>
>>>
>>>
>>> _______________________________________________
>>> Wikisource-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>
>>
>> _______________________________________________
>> Wikisource-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
>
> --
> Etiamsi omnes, ego non
>
> _______________________________________________
> Wikisource-l mailing 
> [email protected]https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
>
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>

_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Reply via email to