On Fri, Jul 19, 2013 at 8:13 AM, Thibaut Horel <[email protected]>wrote:
> I don't see the possibility of directly editing the ABBYY xml file > happening any time soon. In theory, it should be possible, since that is > somewhat similar to what Visual Editor is doing: providing a WYSIWYG > interface to edit structured data (html+rdf in VE's case). But that's a > (very) long-term plan, and its relevance is not even clear to me. In this > regard, I agree with what David and Alex said. > > Still, there are two things we could do with these xml files: > > * extract information beyond the raw text to do some pre-formatting prior > to the page creation: this could include paragraphs, centered texts etc. > Some good OCR/layout detection softwares are even able to detect font > information, like bold or italic. However, and I could be wrong here, it > seems to me that the impact of such pre-formatting would be limited: when > proofreading, most of the time is spent correcting OCR mistakes, the > formatting can be made on-the-go and has an almost negligible time cost. > I still think that doing most of the work automatically (if possible) would be a good idea. I actually like formatting (eg bold, italics) much more than I like proofreading OCR, but I also think that the less burden we give our proofreaders the better it is. I mean, if I'm proofreading a text, and I see the text is already well formatted, it saves time: if it's formatted badly, I can still correct it, right? > * import the proofread text back into the xml file. By doing so, we would > recover the position of words across the page for the proofread text. This > would allow us to provide PDFs with a curated text layer. Such PDFs would > be truly and fully searchable, which I think would be highly valuable for > bibliophiles. This task somehow requires to align two texts: map each word > in the proofread text to one word in the original ABBY file (this is not > entirely accurate since two words are sometimes recognized as a single word > by the OCR, and vice versa). I have a few ideas on how to properly solve > this problem: it is actually very similar (and even simpler!) to the > so-called "phrase alignment" problem found in machine translation and > natural language processing and the probabilistic models it uses could > easily be adapted to our problem. I know that some attempts have been made > in the past to tackle this problem, but I don't have a clear view of what > has been tried exactly, and how successful the attempts were. I would > highly appreciate any information you could have about this. > > I think Seb35 studied a bit the subject few years ago, with all the probabilistic things and markovian chains and funny stuff you all like :-) (I always amazes me how many mathematicians or like are involved in Wikisource. My conclusion is that we like to put order in abstract spaces. Aubrey > Thibaut > > > On 07/17/2013 10:13 PM, David Cuenca wrote: > > I agree with Alex, the xml is not about getting editors to work with it, > but to improve the output of the text. If it can be combined with the > Visual Editor to add some pre-formatting and maybe signaling which words > are unclear, that would be already a big improvement. > > If in addition to that, it can be used to compare proofread text with ocr > text for remapping purposes, even better. > > Micru > > On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo <[email protected]>wrote: > >> Perhaps there's a misinterpretation - I mentioned abbyy.xml but with no >> project to import it as-it-is; abbyy.xml is only a surprising data >> container from which extract anything useful to speed up proofreading (and >> formatting) - nothing more than this. >> >> Just an example: vertical djvu coordinates of lines can be used to get >> font-size; horizontal coordinates of lines can be used to recognize >> centered text; paragraphs splitting is valuable; coolumns can be >> recognized; margin too; with some effort probably poems can pop up. >> >> Far from simply importing coordinates, it's a matter of use them at >> our best; no data, no data information contents. >> >> Alex >> >> >> 2013/7/17 Lars Aronsson <[email protected]> >> >>> On 07/17/2013 12:57 PM, Alex Brollo wrote: >>> >>>> FineReader OCR stores an incredibly detailed information in [...] >>>> abbyy.xml >>>> >>> >>> In the other end, Wikisource is a wiki that edits wiki text. >>> Sure, you could insert the XML there and let users >>> edit the XML, but that would scare more users away >>> and allow for more mistakes. >>> >>> For example, if proofreading Hamlet, >>> >>> To be or not to bc, that is the question, >>> >>> anybody can easily spot "bc" and correct that. >>> In the XML version, >>> >>> <word x=1 y=1>To</word> >>> <word x=5 y=1>be</word> >>> <word x=8 y=1>or</word> >>> >>> someone might think that "or" should be a litte more >>> to the right, so one user inserts a space between the >>> tag "<word x=8 y=1>" and "or", while another user >>> adjusts the tag to "<word x=9 y=1>". All the tags >>> make it harder to spot the OCR error "bc". >>> >>> Even if you replace manual XML editing with some >>> graphic tool, you get the same ambiguity between >>> adding whitespace and adjusting coordinates. >>> >>> This is a nightmare that we avoid by throwing away >>> all the coordinates and just proofreading the plain text. >>> It is not the perfect system, it's a compromise, in >>> order to get some useful work done. >>> >>> >>> -- >>> Lars Aronsson ([email protected]) >>> Project Runeberg - free Nordic literature - http://runeberg.org/ >>> >>> >>> >>> _______________________________________________ >>> Wikisource-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l >>> >> >> >> _______________________________________________ >> Wikisource-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l >> >> > > > -- > Etiamsi omnes, ego non > > _______________________________________________ > Wikisource-l mailing > [email protected]https://lists.wikimedia.org/mailman/listinfo/wikisource-l > > > > _______________________________________________ > Wikisource-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > >
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
