[Bug 35122] New: Extracted OCR text layer is mutilated

bugzilla-daemon Sat, 10 Mar 2012 03:07:59 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=35122


       Web browser: ---
             Bug #: 35122
           Summary: Extracted OCR text layer is mutilated
           Product: MediaWiki extensions
           Version: any
          Platform: All
        OS/Version: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: Unprioritized
         Component: PdfHandler
        AssignedTo: [email protected]
        ReportedBy: [email protected]
    Classification: Unclassified
   Mobile Platform: ---


[[Commons:File:Иннокентий Анненский - Царь Иксион, 1902.pdf]]
or
http://commons.wikimedia.org/wiki/File:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf
in the new version uploaded March 10, 2012,
is a PDF/A file with page images and OCR text layer, generated
from ABBYY Finereader OCR software.

The program pdftotext extracts the OCR text layer, which for the
first page begins: "Дннѳнскій.\n\nТ Р А Г Е Д І Я\nВЪ пяти ДѢЙСТВІЯХЪ\n".
(This text contains a few OCR errors, such as the initial "Д", which
is a misinterpreted "А", but this is entirely normal.)

The pdftotext output, piped through "od -c" begins:
 0000000 320 224 320 275 320 275 321 263 320 275 321 201 320 272 321 226
 0000020 320 271   .  \n  \n 320 242     320 240     320 220     320 223
 0000040     320 225     320 224     320 206     320 257  \n 320 222 320
 0000060 252     320 277 321 217 321 202 320 270     320 224 321 242 320

However, when the ProofreadPage extension tries to extract the text,
using the PdfHandler, the text passes through UtfNormal::cleanUp()
(line 140 of source file extensions/PdfHandler/PdfHandler.image.php),
and only the period, newline, some hyphens and digits come through.
Try this at the Russian Wikisource, by clicking the red-linked page numbers,
http://ru.wikisource.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf

Pages are correctly split on \f (form feed).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 35122] New: Extracted OCR text layer is mutilated

Reply via email to