https://bugzilla.wikimedia.org/show_bug.cgi?id=21526

           Summary: Bug in Djvu text layer extraction
           Product: MediaWiki
           Version: 1.16-svn
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: Normal
         Component: DjVu
        AssignedTo: [email protected]
        ReportedBy: [email protected]


Bug has been encountered on fr.wikisource :

MediaWiki       1.16alpha-wmf (r58524)
PHP     5.2.4-2ubuntu5.7wm1 (apache2handler)
MySQL   4.0.40-wikimedia-log

When the text layer of the Djvu file contains « ") », the MediaWiki parser
produces an empty page and then the text layer is shifted by one page from the
image. An example of problematic Djvu file can be found here :

http://commons.wikimedia.org/w/index.php?title=File:Sima_qian_chavannes_memoires_historiques_v4.djvu&oldid=31865251

In particular, we can find, in page 80, the following text (bad quality of
scan) : « La quatrième année (.\"),*)()) ». The problem can be seen in the
proofread version of this scan :

http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/80&action=edit
: the end of the text is missing
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/81&action=edit
: no text layer
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/82&action=edit
: text layer and image does not longer match

I have been able to track and fix the bug in my local mediawiki installation
(same branch, same revision as fr.wikisource). The problem is located in
DjvuImage::retrieveMetadata (includes/DjvuImage.php:257) : the regular
expression considers any ") as the end of page marker, but a \ before the
double quote should prevent this interpretation. 

I replaced the current regular expression by this one, and now the problem is
fixed :

$reg =
"/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*\"((?>\\\\.|(?:(?!\\\\|\").)++)*?)\"\s*\)/s";
$txt = preg_replace( $reg, "<PAGE value=\"$1\" />", $txt  );

Note for the regular expression : this is the adaptation of the regular
expression used to match a text between double quotes with backslash as escape
character, which in perl would be :
"((?>\\.|[^"\\]++)*?)". The rather ugly (but working) (?:(?!\\\\|\&quot;).)
corresponds to the trivial [^"\\], but the problem is that [^\&quot;] and [^"]
are not really the same thing…


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to