https://bugzilla.wikimedia.org/show_bug.cgi?id=35122
Web browser: ---
Bug #: 35122
Summary: Extracted OCR text layer is mutilated
Product: MediaWiki extensions
Version: any
Platform: All
OS/Version: All
Status: UNCONFIRMED
Severity: normal
Priority: Unprioritized
Component: PdfHandler
AssignedTo: [email protected]
ReportedBy: [email protected]
Classification: Unclassified
Mobile Platform: ---
[[Commons:File:Иннокентий Анненский - Царь Иксион, 1902.pdf]]
or
http://commons.wikimedia.org/wiki/File:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf
in the new version uploaded March 10, 2012,
is a PDF/A file with page images and OCR text layer, generated
from ABBYY Finereader OCR software.
The program pdftotext extracts the OCR text layer, which for the
first page begins: "Дннѳнскій.\n\nТ Р А Г Е Д І Я\nВЪ пяти ДѢЙСТВІЯХЪ\n".
(This text contains a few OCR errors, such as the initial "Д", which
is a misinterpreted "А", but this is entirely normal.)
The pdftotext output, piped through "od -c" begins:
0000000 320 224 320 275 320 275 321 263 320 275 321 201 320 272 321 226
0000020 320 271 . \n \n 320 242 320 240 320 220 320 223
0000040 320 225 320 224 320 206 320 257 \n 320 222 320
0000060 252 320 277 321 217 321 202 320 270 320 224 321 242 320
However, when the ProofreadPage extension tries to extract the text,
using the PdfHandler, the text passes through UtfNormal::cleanUp()
(line 140 of source file extensions/PdfHandler/PdfHandler.image.php),
and only the period, newline, some hyphens and digits come through.
Try this at the Russian Wikisource, by clicking the red-linked page numbers,
http://ru.wikisource.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf
Pages are correctly split on \f (form feed).
--
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l