https://bugzilla.wikimedia.org/show_bug.cgi?id=57807

       Web browser: ---
            Bug ID: 57807
           Summary: Merge proofread text back into Djvu files
           Product: MediaWiki extensions
           Version: unspecified
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: Unprioritized
         Component: Extensions requests
          Assignee: [email protected]
          Reporter: [email protected]
    Classification: Unclassified
   Mobile Platform: ---

Merge proofread text back into Djvu files

Wikisource, the free library, has an enormous collection of Djvu files and
proofread texts based on those scans. However, while the DjVu files contain a
text layer, this text is the original computer generated (OCR) text and not the
volunteer-proofread text. There is some previous work about merging the
proofread text as a blob into pages, and also about finding similar words to be
used as anchors for text re-mapping. The idea is to create an export tool that
will get word positions and confidence levels using Tesseract and then re-map
the text layer back into the DjVu file. If possible, word coordinates should be
kept.

    Project proposed by Micru. I have found an external mentor that could give
a hand on Tesseract, now I'm looking for a mentor that would provide assistance
on Mediawiki.
    Aubrey can be a mentor providing assistance regarding Wikisource, and some
past history of this issue. Not much, but glad to help if needed.
    Rtdwivedi is willing to be a mentor.


URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to