https://bugzilla.wikimedia.org/show_bug.cgi?id=21526
Summary: Bug in Djvu text layer extraction
Product: MediaWiki
Version: 1.16-svn
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: Normal
Component: DjVu
AssignedTo: [email protected]
ReportedBy: [email protected]
Bug has been encountered on fr.wikisource :
MediaWiki 1.16alpha-wmf (r58524)
PHP 5.2.4-2ubuntu5.7wm1 (apache2handler)
MySQL 4.0.40-wikimedia-log
When the text layer of the Djvu file contains « ") », the MediaWiki parser
produces an empty page and then the text layer is shifted by one page from the
image. An example of problematic Djvu file can be found here :
http://commons.wikimedia.org/w/index.php?title=File:Sima_qian_chavannes_memoires_historiques_v4.djvu&oldid=31865251
In particular, we can find, in page 80, the following text (bad quality of
scan) : « La quatrième année (.\"),*)()) ». The problem can be seen in the
proofread version of this scan :
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/80&action=edit
: the end of the text is missing
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/81&action=edit
: no text layer
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/82&action=edit
: text layer and image does not longer match
I have been able to track and fix the bug in my local mediawiki installation
(same branch, same revision as fr.wikisource). The problem is located in
DjvuImage::retrieveMetadata (includes/DjvuImage.php:257) : the regular
expression considers any ") as the end of page marker, but a \ before the
double quote should prevent this interpretation.
I replaced the current regular expression by this one, and now the problem is
fixed :
$reg =
"/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*\"((?>\\\\.|(?:(?!\\\\|\").)++)*?)\"\s*\)/s";
$txt = preg_replace( $reg, "<PAGE value=\"$1\" />", $txt );
Note for the regular expression : this is the adaptation of the regular
expression used to match a text between double quotes with backslash as escape
character, which in perl would be :
"((?>\\.|[^"\\]++)*?)". The rather ugly (but working) (?:(?!\\\\|\").)
corresponds to the trivial [^"\\], but the problem is that [^\"] and [^"]
are not really the same thing…
--
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l