Thanks Alex! I really hope this is a direction where other developers will follow: being able to harness the full potential of structured data from OCR software is absolutely crucial for Wikisource: we could actually automatize *a lot* of the formatting work now done by volunteers, and their time could be spent still formatting, proofreading and validating, but with much power than before. IMO, it changes a lot if a book is formatted ~50% by a machine, we could do much more books in less time. Go Alex!
Aubrey On Mon, Oct 16, 2017 at 5:42 PM, Asaf Bartov <[email protected]> wrote: > That's really promising! > > Thank you for sharing this. > > A. > > On Oct 17, 2017 00:11, "Alex Brollo" <[email protected]> wrote: > >> Here: >> Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46 >> <https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46> >> and immediately previous and following pages both the text and some >> formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz >> <https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz> >> (in previous pages only some templates have been added and a little bit >> of regex manipulation has be done) >> >> Internet Archive _abbyy.gz files are gzipped, enormous xml files where >> any detail of FineReader OCR output is exported - but, even if enormous and >> terribly complex, they can be parsed and any detail (a little bit >> painfully...) can be used; presently, only bold, italic, smallcaps and >> paragraphs have been explored, translated into wiki code by a prettily >> simple python code. >> >> Alex >> >> >> >> _______________________________________________ >> Wikisource-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l >> >> > _______________________________________________ > Wikisource-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > >
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
