https://bugzilla.wikimedia.org/show_bug.cgi?id=57807
Web browser: ---
Bug ID: 57807
Summary: Merge proofread text back into Djvu files
Product: MediaWiki extensions
Version: unspecified
Hardware: All
OS: All
Status: UNCONFIRMED
Severity: normal
Priority: Unprioritized
Component: Extensions requests
Assignee: [email protected]
Reporter: [email protected]
Classification: Unclassified
Mobile Platform: ---
Merge proofread text back into Djvu files
Wikisource, the free library, has an enormous collection of Djvu files and
proofread texts based on those scans. However, while the DjVu files contain a
text layer, this text is the original computer generated (OCR) text and not the
volunteer-proofread text. There is some previous work about merging the
proofread text as a blob into pages, and also about finding similar words to be
used as anchors for text re-mapping. The idea is to create an export tool that
will get word positions and confidence levels using Tesseract and then re-map
the text layer back into the DjVu file. If possible, word coordinates should be
kept.
Project proposed by Micru. I have found an external mentor that could give
a hand on Tesseract, now I'm looking for a mentor that would provide assistance
on Mediawiki.
Aubrey can be a mentor providing assistance regarding Wikisource, and some
past history of this issue. Not much, but glad to help if needed.
Rtdwivedi is willing to be a mentor.
URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l