Dan Caprioara created FOP-2880:
----------------------------------

             Summary: [PATCH] Hyphenated words are not searchable in readers
                 Key: FOP-2880
                 URL: https://issues.apache.org/jira/browse/FOP-2880
             Project: FOP
          Issue Type: Bug
          Components: unqualified
    Affects Versions: 2.3
            Reporter: Dan Caprioara


The hyphenated words are rendered by FOP using the hard hyphen character. This 
contradicts the PDF specification, where in section:

14.8.2.2.3 Incidental Artifacts

clearly states that the SHY  soft hyphen U+00AD character should be used. 

The effect is that the hyphenated words are not searchable, and the copy/paste 
feature includes also the hard hyphens, instead of removing them and joining 
the words pieces together.

Here is a small patch that can be applied on the FOP core project in order to 
fix this:

{code}
Index: src/main/java/org/apache/fop/fo/FOPropertyMapping.java
===================================================================
--- src/main/java/org/apache/fop/fo/FOPropertyMapping.java      (revision 
190759)
+++ src/main/java/org/apache/fop/fo/FOPropertyMapping.java      (working copy)
@@ -1106,7 +1106,10 @@
         // hyphenation-character
         m  = new CharacterProperty.Maker(PR_HYPHENATION_CHARACTER);
         m.setInherited(true);
-        m.setDefault("-");
+        
+        // m.setDefault("-");
+        m.setDefault("\u00ad");
+        
         addPropertyMaker("hyphenation-character", m);
 
         // hyphenation-push-character-count
Index: src/main/java/org/apache/fop/render/pdf/PDFPainter.java
===================================================================
--- src/main/java/org/apache/fop/render/pdf/PDFPainter.java     (revision 
190759)
+++ src/main/java/org/apache/fop/render/pdf/PDFPainter.java     (working copy)
@@ -420,7 +420,8 @@
             PDFStructElem structElem = (PDFStructElem) 
getContext().getStructureTreeElement();
             languageAvailabilityChecker.checkLanguageAvailability(text);
             MarkedContentInfo mci = 
logicalStructureHandler.addTextContentItem(structElem);
-            String actualText = getContext().isHyphenated() ? 
text.substring(0, text.length() - 1) : null;
+//            String actualText = getContext().isHyphenated() ? 
text.substring(0, text.length() - 1) : null;
+            String actualText = null;
             generator.endTextObject();
             generator.updateColor(state.getTextColor(), true, null);
             generator.beginTextObject(mci.tag, mci.mcid, actualText);
@@ -490,6 +491,14 @@
             float glyphAdjust = 0;
             if (font.hasCodePoint(orgChar)) {
                 ch = font.mapCodePoint(orgChar);
+                                                               if (orgChar == 
'\u00ad'){
+                                                                 // Map it 
back to the SHY, the hard hyphen is not correct, causes the hyphenated words 
not being searchable.
+                                                                 // See the 
PDF Spec: 14.8.2.2.3 Incidental Artifacts / Hyphenation paragraph.
+                                                                 
+                                                                 // The ansi 
encoding CodePointMapping has the hyphenation char with two entries,
+                                                                 // the first 
is selected, the hard hyphen. Reverting... 
+                               ch = orgChar;
+                       } 
                 ch = selectAndMapSingleByteFont(tf, fontName, fontSize, 
textutil, ch);
                 if ((wordSpacing != 0) && 
CharUtilities.isAdjustableSpace(orgChar)) {
                     glyphAdjust += wordSpacing;

{code}





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to