[Bug 53146] Whitespace and Hyphenation with Zero Width Space

2012-04-29 Thread bugzilla
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146

Glenn Adams gad...@apache.org changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED

--- Comment #5 from Glenn Adams gad...@apache.org ---
batch transition resolved+invalid to closed+invalid

-- 
You are receiving this mail because:
You are the assignee for the bug.


DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space

2012-04-26 Thread bugzilla
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146

Vincent Hennebert vhenneb...@gmail.com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||INVALID

--- Comment #4 from Vincent Hennebert vhenneb...@gmail.com 2012-04-26 
14:13:00 UTC ---
Hi Thomas,

you didn't actually specify text-align=justify on the second block. If you do
then you get the same behaviour as in the first block, which is expected since
the hyphen is explicit. So it's always allowed to break after it even if
hyphenation has been set to false.

As to the fact that when text-align is not set to justify, then a break is
made after the first /, then I believe it's compliant with the line-breaking
rules defined in Unicode UAX #14:
http://www.unicode.org/reports/tr14/#SY

In order to prevent that, you would have to define a rule that says something
like Do not break between a solidus and a letter if that solidus is preceded
by white space. I believe this is outside the scope of UAX #14.

In that case, I think the solution is to put a Word Joiner (U+2060) between the
first / and the rest of the URL.

Closing this bug as invalid, feel free to re-open it if you don't agree with
the analysis.

Vincent

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


DO NOT REPLY [Bug 53146] New: Whitespace and Hyphenation with Zero Width Space

2012-04-25 Thread bugzilla
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146

 Bug #: 53146
   Summary: Whitespace and Hyphenation with Zero Width Space
   Product: Fop
   Version: 1.0
  Platform: PC
OS/Version: Linux
Status: NEW
  Severity: normal
  Priority: P2
 Component: general
AssignedTo: fop-dev@xmlgraphics.apache.org
ReportedBy: tom_s...@web.de
Classification: Unclassified


Created attachment 28677
  -- https://issues.apache.org/bugzilla/attachment.cgi?id=28677
FO file with 2 fo:blocks, text and long URL

See the attached FO file. It contains two fo:blocks: a text with a long Unix
file path. 

The only difference here is a Zero Width Space (ZWS, U+200B) at appropriate
locations (here: after the /) and the hyphenation attribute. The ZWS is used
to indicate possible break opportunities in case hyphenation=false.

However, in the 2nd block, the beginning of the path /usr/... is broken into
/, a linebreak, and usr/... although hyphenate=false. I'm not sure if this
is the correct behavior. Also compare the word spacing in the 1st block.

I would have expected no linbreak between / and usr and a more balanced
space between words (although it might be technically problematic). :)

I'll upload also two PDFs, one built with FOP, the other built with XEP.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space

2012-04-25 Thread bugzilla
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146

--- Comment #1 from Thomas Schraitle tom_s...@web.de 2012-04-25 11:55:22 UTC 
---
Created attachment 28678
  -- https://issues.apache.org/bugzilla/attachment.cgi?id=28678
FO file built with FOP 1.0

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space

2012-04-25 Thread bugzilla
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146

--- Comment #2 from Thomas Schraitle tom_s...@web.de 2012-04-25 11:56:08 UTC 
---
Created attachment 28679
  -- https://issues.apache.org/bugzilla/attachment.cgi?id=28679
FO file built with XEP 4.19 build 20110414

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


DO NOT REPLY [Bug 53146] Whitespace and Hyphenation with Zero Width Space

2012-04-25 Thread bugzilla
https://issues.apache.org/bugzilla/show_bug.cgi?id=53146

Glenn Adams gad...@apache.org changed:

   What|Removed |Added

   Priority|P2  |P3

--- Comment #3 from Glenn Adams gad...@apache.org 2012-04-25 15:26:57 UTC ---
lower priority since no patch is available

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


Re: zero width space

2005-11-04 Thread J.Pietschmann

Manuel Mall wrote:

What about character composition/decomposition?


Good question? Where is the answer?


Lets clarify the problem first. Let's say the input contains
the sequence U+0061 U+0308 (latin small a, combining diaresis),
the font has a glyph for U+00E4 but not U+0308. Obviously,
putting the precomposed character U+00E4 into the output is
a smart move. Where should this transformation occur: output
generation, renderer, layout stage? A slight problem is that
the width of U+00E4 may be different from U+0061.

J.Pietschmann


Re: zero width space

2005-11-03 Thread J.Pietschmann

Manuel Mall wrote:
With respect to U+200B it says in 

[snip]
It therefore surprises me that you imply U+200B may expand in 
justification.


The Unicode 3.0 book explicitely mentions that ZWS may be expanded
for justification, to my great surprise. The 2.0 book doesn't have
any remarks in this direction. I don't have access to a book more
recent than 3.0. Maybe they changed mind (again...).

Thanks for that list. With respect to the issue at hand, that is which 
codepoints should be given to the renderers it seems there are 3 types: 

...
2. Those we never give to the renderers, e.g. Soft Hyphen (its either 
suppressed or replaced by the proper hyphen), zero-width joiners, ...


In case of the hypothetical HTML renderer, you *want* to pass all these
characters to the renderer.


Is that a sensible grouping?


Dunno.
What about character composition/decomposition?


J.Pietschmann


Re: zero width space

2005-11-03 Thread Manuel Mall
On Fri, 4 Nov 2005 04:46 am, J.Pietschmann wrote:
 Manuel Mall wrote:
  With respect to U+200B it says in

 [snip]

  It therefore surprises me that you imply U+200B may expand in
  justification.

 The Unicode 3.0 book explicitely mentions that ZWS may be expanded
 for justification, to my great surprise. The 2.0 book doesn't have
 any remarks in this direction. I don't have access to a book more
 recent than 3.0. Maybe they changed mind (again...).

Any one out there who has the 4.0 book and can shed some light on this?

  Thanks for that list. With respect to the issue at hand, that is
  which codepoints should be given to the renderers it seems there
  are 3 types:

 ...

  2. Those we never give to the renderers, e.g. Soft Hyphen (its
  either suppressed or replaced by the proper hyphen), zero-width
  joiners, ...

 In case of the hypothetical HTML renderer, you *want* to pass all
 these characters to the renderer.

I would see a HTML renderer more like the RTF renderer which bypasses 
all the LayoutManager logic and it is really only a 'simple' 
conversion. That is XSL-FO formatting instructions are translated into 
HTML/CSS (or RTF) formatting instructions but no actual layout is 
performed (no page breaking, line breaking and the like). And yes for 
those types of renderers all text would need to be preserved.


  Is that a sensible grouping?

 Dunno.
 What about character composition/decomposition?

Good question? Where is the answer?


 J.Pietschmann
Manuel


Re: zero width space

2005-11-03 Thread Peter S. Housel
On Fri, 2005-11-04 at 11:55 +0800, Manuel Mall wrote:
 On Fri, 4 Nov 2005 04:46 am, J.Pietschmann wrote:
  Manuel Mall wrote:
   With respect to U+200B it says in
 
  [snip]
 
   It therefore surprises me that you imply U+200B may expand in
   justification.
 
  The Unicode 3.0 book explicitely mentions that ZWS may be expanded
  for justification, to my great surprise. The 2.0 book doesn't have
  any remarks in this direction. I don't have access to a book more
  recent than 3.0. Maybe they changed mind (again...).
 
 Any one out there who has the 4.0 book and can shed some light on this?

It says that U+200B normally has no effect on letter spacing in most
scripts, but only indicates a word boundary (and therefore a possible
line break).  It also mentions that when letter-spacing Thai it may grow
to have a non-zero width, but that is the exception. (Thai apparently
doesn't put spaces between words, and uses U+200B as a word separator.)

-- 
Peter S. Housel [EMAIL PROTECTED]


Re: zero width space

2005-11-02 Thread J.Pietschmann

Manuel Mall wrote:
That seems to be the consensus, that is consider ZWS for line breaking 
but then discard and don't give it to the renderers.



Renderers could deal with ZWS if the font would have a glyph for
this character; unfortunately, that's not the case for the PDF
standard fonts  :-)  Some fonts *do* have glyphs for various Unicode
space characters, notably the fixed width spaces.

This leads to the question: Is a space a character? What *is* a
character? The Unicode people had endless discussions about this.
Spaces are exactly in the gray area between real characters
which leave marks and layout control.

Handling space characters in layout and discarding them before
rendering has the distinctive advantage that they work for
any font in any renderer (which can handle variable space areas
properly, of course). OTOH, renderers which output a format which
can handle the spaces itself, like a hypothetical HTML renderer,
would better get the original character.

Are there any other (unusual Unicode) characters which fall in the same 
category that is they influence layout decisions but should not be seen 
by the renderers?


* Unicode spaces
 + variable with spaces
   - ordinary space U+0020
   - ordinary non-breaking space U+00A0
 + fixed width spaces; potentially available in fonts and *may*
   be passed to renderers, *except* for U+200B
   - zero width space U+200B, may expand in justification (not
 implemented this way in FOP 0.20.5, which will haunt us)
   - zero width non breaking space, aka byte order mark U+FEFF,
 should now only be used as BOM (as the BOM is eaten by the
 XML parser, FOP could emit a deprecated warning)
   - en quad U+2000, according to my Unicode book *identical* to
 U+2002, *not* a 4en space (strange)
   - em quad U+2001, similar to U+2000
   - en space aka nut U+2002,
   - em space aka mutton U+2003
   - three-per-em space aka thick space (1/3 em width) U+2004
   - four-per-em space aka mid space (1/4 em width) U+2005
   - six-per-em space (generally 1/6 em width) U+2006
   - figure space (font dependent) U+2007
   - punctuation space (as wide as a dot or comma) U+2008
   - thin space (1/5..1/8 em width) U+2009
   - hair space (1/10..1/16 em width) U+200A
   - narrow no-break space (probably 1/6 em width) U+202F
   - mathematical space U+205F
   - non breaking word joiner U+2060 replaces U+FFEF in text
   - ideographic space U+3000
   - OGHAM SPACE MARK U+1680 (odd stuff)
   - Note: ETHIOPIC WORDSPACE U+1361 leaves marks and is therefore
 not a space. At least I hope so.
 + see also
http://en.wikipedia.org/wiki/Space_character
http://www.alistapart.com/stories/emen/

* Other characters
 + Character shaping hints; they do not cause line breaks.
   - zero width joiner U+200D
   - zero width non-joiner U+200C (may probably also hint at
 preventing ligatures)
   - see http://en.wikipedia.org/wiki/Zero-width_joiner et al.
 + Soft hyphen U+00AD. Must be hidden if no line break follows.
 + Formatting characters. I'd say these characters should not occur
   in XSLFO source, because there are FO which represent the same
   functionality.
   - line separator U+2028, FOP 0.20.5 creates an unconditional line
 break regardless of any FO properties
   - paragraph separator U+2029
   - bidi control characters 200E-200F, 202A-202E
   - deprecated controls 206A-206F


J.Pietschmann


zero width space

2005-11-01 Thread Manuel Mall
Currently if one puts a zero-width-space (U+200B) into an XSL-FO file 
(or specifies linefeed-treatment=treat-as-zero-width-space) it is 
rendered as a missing character in PDF. Is that correct, i.e. does 
this character have to exist in the font used or should the formatter 
or renderer simply remove this character? It is the second approach 
that both AntennaHouse and RenderX appear to have chosen.

Manuel


Re: zero width space

2005-11-01 Thread Chris Bowditch

Manuel Mall wrote:

Currently if one puts a zero-width-space (U+200B) into an XSL-FO file 
(or specifies linefeed-treatment=treat-as-zero-width-space) it is 
rendered as a missing character in PDF. Is that correct, i.e. does 
this character have to exist in the font used or should the formatter 
or renderer simply remove this character? It is the second approach 
that both AntennaHouse and RenderX appear to have chosen.


I recommend that no character is output for a ZWS. The whole purpose of 
placing a ZWS into the input XSL-FO is to give layout an extra break 
opportunity, without changing the appearance of the generated document.


Chris




Re: zero width space

2005-11-01 Thread Andreas L Delmelle

On Nov 1, 2005, at 15:52, Manuel Mall wrote:


Currently if one puts a zero-width-space (U+200B) into an XSL-FO file
(or specifies linefeed-treatment=treat-as-zero-width-space) it is
rendered as a missing character in PDF. Is that correct, i.e. does
this character have to exist in the font used or should the formatter
or renderer simply remove this character? It is the second approach
that both AntennaHouse and RenderX appear to have chosen.


It's certainly not correct to render a missing glyph character, but  
it would also be wrong to remove it too early. The character doesn't  
take part in white-space treatment/collapsing, since it's not XML  
whitespace. It's somewhere in layout that the decision has to be made  
not to allocate space for this character, but it could play a part in  
line-building...


My two cents.

Cheers,

Andreas



Re: zero width space

2005-11-01 Thread Manuel Mall
On Wed, 2 Nov 2005 01:03 am, Chris Bowditch wrote:
 Manuel Mall wrote:
  Currently if one puts a zero-width-space (U+200B) into an XSL-FO
  file (or specifies linefeed-treatment=treat-as-zero-width-space)
  it is rendered as a missing character in PDF. Is that correct,
  i.e. does this character have to exist in the font used or should
  the formatter or renderer simply remove this character? It is the
  second approach that both AntennaHouse and RenderX appear to have
  chosen.

 I recommend that no character is output for a ZWS. The whole purpose
 of placing a ZWS into the input XSL-FO is to give layout an extra
 break opportunity, without changing the appearance of the generated
 document.

That seems to be the consensus, that is consider ZWS for line breaking 
but then discard and don't give it to the renderers.

Are there any other (unusual Unicode) characters which fall in the same 
category that is they influence layout decisions but should not be seen 
by the renderers?

 Chris

Manuel