[Bug 53146] Whitespace and Hyphenation with Zero Width Space
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146 Glenn Adams gad...@apache.org changed: What|Removed |Added Status|RESOLVED|CLOSED --- Comment #5 from Glenn Adams gad...@apache.org --- batch transition resolved+invalid to closed+invalid -- You are receiving this mail because: You are the assignee for the bug.
DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146 Vincent Hennebert vhenneb...@gmail.com changed: What|Removed |Added Status|NEW |RESOLVED Resolution||INVALID --- Comment #4 from Vincent Hennebert vhenneb...@gmail.com 2012-04-26 14:13:00 UTC --- Hi Thomas, you didn't actually specify text-align=justify on the second block. If you do then you get the same behaviour as in the first block, which is expected since the hyphen is explicit. So it's always allowed to break after it even if hyphenation has been set to false. As to the fact that when text-align is not set to justify, then a break is made after the first /, then I believe it's compliant with the line-breaking rules defined in Unicode UAX #14: http://www.unicode.org/reports/tr14/#SY In order to prevent that, you would have to define a rule that says something like Do not break between a solidus and a letter if that solidus is preceded by white space. I believe this is outside the scope of UAX #14. In that case, I think the solution is to put a Word Joiner (U+2060) between the first / and the rest of the URL. Closing this bug as invalid, feel free to re-open it if you don't agree with the analysis. Vincent -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
DO NOT REPLY [Bug 53146] New: Whitespace and Hyphenation with Zero Width Space
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146 Bug #: 53146 Summary: Whitespace and Hyphenation with Zero Width Space Product: Fop Version: 1.0 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: general AssignedTo: fop-dev@xmlgraphics.apache.org ReportedBy: tom_s...@web.de Classification: Unclassified Created attachment 28677 -- https://issues.apache.org/bugzilla/attachment.cgi?id=28677 FO file with 2 fo:blocks, text and long URL See the attached FO file. It contains two fo:blocks: a text with a long Unix file path. The only difference here is a Zero Width Space (ZWS, U+200B) at appropriate locations (here: after the /) and the hyphenation attribute. The ZWS is used to indicate possible break opportunities in case hyphenation=false. However, in the 2nd block, the beginning of the path /usr/... is broken into /, a linebreak, and usr/... although hyphenate=false. I'm not sure if this is the correct behavior. Also compare the word spacing in the 1st block. I would have expected no linbreak between / and usr and a more balanced space between words (although it might be technically problematic). :) I'll upload also two PDFs, one built with FOP, the other built with XEP. -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146 --- Comment #1 from Thomas Schraitle tom_s...@web.de 2012-04-25 11:55:22 UTC --- Created attachment 28678 -- https://issues.apache.org/bugzilla/attachment.cgi?id=28678 FO file built with FOP 1.0 -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146 --- Comment #2 from Thomas Schraitle tom_s...@web.de 2012-04-25 11:56:08 UTC --- Created attachment 28679 -- https://issues.apache.org/bugzilla/attachment.cgi?id=28679 FO file built with XEP 4.19 build 20110414 -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146 Glenn Adams gad...@apache.org changed: What|Removed |Added Priority|P2 |P3 --- Comment #3 from Glenn Adams gad...@apache.org 2012-04-25 15:26:57 UTC --- lower priority since no patch is available -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
Re: zero width space
Manuel Mall wrote: What about character composition/decomposition? Good question? Where is the answer? Lets clarify the problem first. Let's say the input contains the sequence U+0061 U+0308 (latin small a, combining diaresis), the font has a glyph for U+00E4 but not U+0308. Obviously, putting the precomposed character U+00E4 into the output is a smart move. Where should this transformation occur: output generation, renderer, layout stage? A slight problem is that the width of U+00E4 may be different from U+0061. J.Pietschmann
Re: zero width space
Manuel Mall wrote: With respect to U+200B it says in [snip] It therefore surprises me that you imply U+200B may expand in justification. The Unicode 3.0 book explicitely mentions that ZWS may be expanded for justification, to my great surprise. The 2.0 book doesn't have any remarks in this direction. I don't have access to a book more recent than 3.0. Maybe they changed mind (again...). Thanks for that list. With respect to the issue at hand, that is which codepoints should be given to the renderers it seems there are 3 types: ... 2. Those we never give to the renderers, e.g. Soft Hyphen (its either suppressed or replaced by the proper hyphen), zero-width joiners, ... In case of the hypothetical HTML renderer, you *want* to pass all these characters to the renderer. Is that a sensible grouping? Dunno. What about character composition/decomposition? J.Pietschmann
Re: zero width space
On Fri, 4 Nov 2005 04:46 am, J.Pietschmann wrote: Manuel Mall wrote: With respect to U+200B it says in [snip] It therefore surprises me that you imply U+200B may expand in justification. The Unicode 3.0 book explicitely mentions that ZWS may be expanded for justification, to my great surprise. The 2.0 book doesn't have any remarks in this direction. I don't have access to a book more recent than 3.0. Maybe they changed mind (again...). Any one out there who has the 4.0 book and can shed some light on this? Thanks for that list. With respect to the issue at hand, that is which codepoints should be given to the renderers it seems there are 3 types: ... 2. Those we never give to the renderers, e.g. Soft Hyphen (its either suppressed or replaced by the proper hyphen), zero-width joiners, ... In case of the hypothetical HTML renderer, you *want* to pass all these characters to the renderer. I would see a HTML renderer more like the RTF renderer which bypasses all the LayoutManager logic and it is really only a 'simple' conversion. That is XSL-FO formatting instructions are translated into HTML/CSS (or RTF) formatting instructions but no actual layout is performed (no page breaking, line breaking and the like). And yes for those types of renderers all text would need to be preserved. Is that a sensible grouping? Dunno. What about character composition/decomposition? Good question? Where is the answer? J.Pietschmann Manuel
Re: zero width space
On Fri, 2005-11-04 at 11:55 +0800, Manuel Mall wrote: On Fri, 4 Nov 2005 04:46 am, J.Pietschmann wrote: Manuel Mall wrote: With respect to U+200B it says in [snip] It therefore surprises me that you imply U+200B may expand in justification. The Unicode 3.0 book explicitely mentions that ZWS may be expanded for justification, to my great surprise. The 2.0 book doesn't have any remarks in this direction. I don't have access to a book more recent than 3.0. Maybe they changed mind (again...). Any one out there who has the 4.0 book and can shed some light on this? It says that U+200B normally has no effect on letter spacing in most scripts, but only indicates a word boundary (and therefore a possible line break). It also mentions that when letter-spacing Thai it may grow to have a non-zero width, but that is the exception. (Thai apparently doesn't put spaces between words, and uses U+200B as a word separator.) -- Peter S. Housel [EMAIL PROTECTED]
Re: zero width space
Manuel Mall wrote: That seems to be the consensus, that is consider ZWS for line breaking but then discard and don't give it to the renderers. Renderers could deal with ZWS if the font would have a glyph for this character; unfortunately, that's not the case for the PDF standard fonts :-) Some fonts *do* have glyphs for various Unicode space characters, notably the fixed width spaces. This leads to the question: Is a space a character? What *is* a character? The Unicode people had endless discussions about this. Spaces are exactly in the gray area between real characters which leave marks and layout control. Handling space characters in layout and discarding them before rendering has the distinctive advantage that they work for any font in any renderer (which can handle variable space areas properly, of course). OTOH, renderers which output a format which can handle the spaces itself, like a hypothetical HTML renderer, would better get the original character. Are there any other (unusual Unicode) characters which fall in the same category that is they influence layout decisions but should not be seen by the renderers? * Unicode spaces + variable with spaces - ordinary space U+0020 - ordinary non-breaking space U+00A0 + fixed width spaces; potentially available in fonts and *may* be passed to renderers, *except* for U+200B - zero width space U+200B, may expand in justification (not implemented this way in FOP 0.20.5, which will haunt us) - zero width non breaking space, aka byte order mark U+FEFF, should now only be used as BOM (as the BOM is eaten by the XML parser, FOP could emit a deprecated warning) - en quad U+2000, according to my Unicode book *identical* to U+2002, *not* a 4en space (strange) - em quad U+2001, similar to U+2000 - en space aka nut U+2002, - em space aka mutton U+2003 - three-per-em space aka thick space (1/3 em width) U+2004 - four-per-em space aka mid space (1/4 em width) U+2005 - six-per-em space (generally 1/6 em width) U+2006 - figure space (font dependent) U+2007 - punctuation space (as wide as a dot or comma) U+2008 - thin space (1/5..1/8 em width) U+2009 - hair space (1/10..1/16 em width) U+200A - narrow no-break space (probably 1/6 em width) U+202F - mathematical space U+205F - non breaking word joiner U+2060 replaces U+FFEF in text - ideographic space U+3000 - OGHAM SPACE MARK U+1680 (odd stuff) - Note: ETHIOPIC WORDSPACE U+1361 leaves marks and is therefore not a space. At least I hope so. + see also http://en.wikipedia.org/wiki/Space_character http://www.alistapart.com/stories/emen/ * Other characters + Character shaping hints; they do not cause line breaks. - zero width joiner U+200D - zero width non-joiner U+200C (may probably also hint at preventing ligatures) - see http://en.wikipedia.org/wiki/Zero-width_joiner et al. + Soft hyphen U+00AD. Must be hidden if no line break follows. + Formatting characters. I'd say these characters should not occur in XSLFO source, because there are FO which represent the same functionality. - line separator U+2028, FOP 0.20.5 creates an unconditional line break regardless of any FO properties - paragraph separator U+2029 - bidi control characters 200E-200F, 202A-202E - deprecated controls 206A-206F J.Pietschmann
zero width space
Currently if one puts a zero-width-space (U+200B) into an XSL-FO file (or specifies linefeed-treatment=treat-as-zero-width-space) it is rendered as a missing character in PDF. Is that correct, i.e. does this character have to exist in the font used or should the formatter or renderer simply remove this character? It is the second approach that both AntennaHouse and RenderX appear to have chosen. Manuel
Re: zero width space
Manuel Mall wrote: Currently if one puts a zero-width-space (U+200B) into an XSL-FO file (or specifies linefeed-treatment=treat-as-zero-width-space) it is rendered as a missing character in PDF. Is that correct, i.e. does this character have to exist in the font used or should the formatter or renderer simply remove this character? It is the second approach that both AntennaHouse and RenderX appear to have chosen. I recommend that no character is output for a ZWS. The whole purpose of placing a ZWS into the input XSL-FO is to give layout an extra break opportunity, without changing the appearance of the generated document. Chris
Re: zero width space
On Nov 1, 2005, at 15:52, Manuel Mall wrote: Currently if one puts a zero-width-space (U+200B) into an XSL-FO file (or specifies linefeed-treatment=treat-as-zero-width-space) it is rendered as a missing character in PDF. Is that correct, i.e. does this character have to exist in the font used or should the formatter or renderer simply remove this character? It is the second approach that both AntennaHouse and RenderX appear to have chosen. It's certainly not correct to render a missing glyph character, but it would also be wrong to remove it too early. The character doesn't take part in white-space treatment/collapsing, since it's not XML whitespace. It's somewhere in layout that the decision has to be made not to allocate space for this character, but it could play a part in line-building... My two cents. Cheers, Andreas
Re: zero width space
On Wed, 2 Nov 2005 01:03 am, Chris Bowditch wrote: Manuel Mall wrote: Currently if one puts a zero-width-space (U+200B) into an XSL-FO file (or specifies linefeed-treatment=treat-as-zero-width-space) it is rendered as a missing character in PDF. Is that correct, i.e. does this character have to exist in the font used or should the formatter or renderer simply remove this character? It is the second approach that both AntennaHouse and RenderX appear to have chosen. I recommend that no character is output for a ZWS. The whole purpose of placing a ZWS into the input XSL-FO is to give layout an extra break opportunity, without changing the appearance of the generated document. That seems to be the consensus, that is consider ZWS for line breaking but then discard and don't give it to the renderers. Are there any other (unusual Unicode) characters which fall in the same category that is they influence layout decisions but should not be seen by the renderers? Chris Manuel