Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-07 Thread Luca Furini


Manuel Mall wrote:

1. The suppress-at-line-break property can be applied to all characters. 
I would take the position at the moment that explicit specification of 
the suppress-at-line-break property is not supported and we worry about 
it at a later stage. I would certainly argue against just supporting it 
in the context of nbsp.


Ok, it's better to take a step at a time!

2. When we discussed UAX#14 line breaking on this list last year Joerg 
pointed out that he had a table driven implementation for it. At the 
the time I took a look, liked it, and updated it for compliance to the 
lastest UAX#14 spec and then shelved it for integration into FOP. That 
is when we move determining line break opportunities to the LineLM 
level (which we discussed extensively before) we get UAX#14 
linebreaking as part of it by integrating Joerg's implementation. As a 
consequence I recommend against putting any UAX#14 specific stuff at 
the lower levels (e.g. TextLM) now in the context of fixing the nbsp 
problem. It will disappear anyway and IMO is therefore not worth the 
effort.


Ok, so for the moment I'll avoid considering interaction between spaces, 
and just fix the character-by-character element creation, which is ready 
and should be enough to handle the most common situations.


This also solves another bug concerning a nbsp being removed when starting 
a line.


I'll make the commit in a few minutes

Regards
Luca


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-06 Thread Manuel Mall
On Monday 06 February 2006 18:44, Luca Furini wrote:
 Manuel Mall wrote:
snip/
 
  1. Justified text: pen INF + elastic glue
  2. All other justification modes: either just a box of the width of
  the space or pen INF + fixed width glue.

 I think in both cases (justified / unjustified text) we could use
 either a sequence with only glues and penalties, or a sequence with
 boxes too.

 For the justified text, it could be:
box w=0 + pen INF + elastic glue

 The choice of the sequence (completely suppressible / with boxes too)
 depends on the suppress-at-line-break property, whose default value
 is auto, meaning that only the normal U+0020 space is suppressed at
 a break.

 However, things are not so simple, and maybe we cannot just check the
 local value of the property. I see a couple of
 potentially-problematic situations.

snip/

Luca,

IMO nbsp (and any other Unicode special spaces) are outside the scope of 
XSL-FO whitespace handling. XSL-FO refers to whitespace as defined in 
XML. In XML only x#20, x#9, x#a, and x#d are considered whitespace. 
Therefore nbsp does not need to be considered when looking at 
white-space-treatment and white-space-collapse. Would that approach 
remove the complications you mentioned?


 If nbsps must be suppressed, should an empty line be created or not?

 WDYT?

 Regards
  Luca

Cheers

Manuel


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-06 Thread Luca Furini

Manuel Mall wrote:

IMO nbsp (and any other Unicode special spaces) are outside the scope of 
XSL-FO whitespace handling. XSL-FO refers to whitespace as defined in 
XML. In XML only x#20, x#9, x#a, and x#d are considered whitespace. 
Therefore nbsp does not need to be considered when looking at 
white-space-treatment and white-space-collapse. Would that approach 
remove the complications you mentioned?


Thanks for the clarification, Manuel!

This solves the first supposed problem (interaction between nbsp and 
pretty-printing spaces), but the second one is still open: what happens if 
we have

  someContentnbspspaceotherContent ?
*IF* (and I'm not at all sure about this) there can be a break , then both 
spaces should be discarded: in order to implement the correct behaviour 
for this almost hypothetical situation, we would need to create elements 
for both spaces as a whole (and thay could belong to different LMs) 
otherwise the algorithm would not be able to ignore the nbsp during the 
line breaking.


Anyway I think this is quite an unlikely combination of entities and 
properties :-) ; as I see you are already working on something else, for 
the moment I will prepare a patch for the most common situations.


Regards
Luca


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-06 Thread Manuel Mall
On Monday 06 February 2006 22:35, Luca Furini wrote:
 Manuel Mall wrote:
  IMO nbsp (and any other Unicode special spaces) are outside the
  scope of XSL-FO whitespace handling. XSL-FO refers to whitespace as
  defined in XML. In XML only x#20, x#9, x#a, and x#d are considered
  whitespace. Therefore nbsp does not need to be considered when
  looking at white-space-treatment and white-space-collapse. Would
  that approach remove the complications you mentioned?

 Thanks for the clarification, Manuel!

 This solves the first supposed problem (interaction between nbsp and
 pretty-printing spaces), but the second one is still open: what
 happens if we have
someContentnbspspaceotherContent ?
 *IF* (and I'm not at all sure about this) there can be a break , then
 both spaces should be discarded: 

IMO yes there can be a break and no only the space needs to be removed. 
Again the argument is that nbsp is not whitespace as per XSL-FO 
definition and need not to be removed.

What makes you think that both the nbsp and the space needs to be 
removed around a fop generated linebreak?

 in order to implement the correct 
 behaviour for this almost hypothetical situation, we would need to
 create elements for both spaces as a whole (and thay could belong to
 different LMs) otherwise the algorithm would not be able to ignore
 the nbsp during the line breaking.

 Anyway I think this is quite an unlikely combination of entities and
 properties :-) ; as I see you are already working on something else,
 for the moment I will prepare a patch for the most common situations.

 Regards
  Luca

Manuel


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-06 Thread Luca Furini

Manuel Mall wrote:


This solves the first supposed problem (interaction between nbsp and
pretty-printing spaces), but the second one is still open: what
happens if we have
   someContentnbspspaceotherContent ?
*IF* (and I'm not at all sure about this) there can be a break , then
both spaces should be discarded: 


IMO yes there can be a break and no only the space needs to be removed. 
Again the argument is that nbsp is not whitespace as per XSL-FO 
definition and need not to be removed.


What makes you think that both the nbsp and the space needs to be 
removed around a fop generated linebreak?


Oops, I forgot to add an importand condition: if the user explicitly 
states that the nsbp must be discarded around a line break:

  fo:inline suppress-at-line-break=suppressnbsp;/fo:inline
Well, the more I look at this, the more it seems unlikely to ever happen 
... we are probably having a highly theoretical disquisition! :-)


Anyway, I was still not sure whether there could be a break so I looked 
back at the Unicode Annex #14.



GL  Non-breaking (Glue) (XB/XA)  (normative)

Non-breaking characters prohibit breaks on either side, but that 
prohibition can be overridden by SP or ZW. In particular, when NBSP 
follows SPACE, there is a break opportunity after the SPACE and NBSP will 
go as visible space onto the next line. See also WJ. The following lists 
the characters of line break class GL with additional description.


00A0 NO-BREAK SPACE (NBSP)
202F NARROW NO-BREAK SPACE (NNBSP)
180E MONGOLIAN VOWEL SEPARATOR (MVS)

NO-BREAK SPACE is the preferred character to use where two words should be 
visually separated but kept on the same line, as in the case of a title 
and a name Dr.NBSPJoseph Becker. When SPACE follows NBSP, there is no 
break, because there never is a break in front of SPACE.  NARROW NO-BREAK 
SPACE is used in Mongolian. The mongolian vowel separator acts like a 
NNBSP in its line breaking behavior. It additionally affects the shaping 
of certain vowel characters as described in [Unicode] Section 12.3, 
Mongolian.



So, it seems there could be a break between SPACE and NBSP (with NBSP 
starting the next line), but not between NBSP and SPACE. Can we say this 
is settled?


Regards
Luca


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-06 Thread Andreas L Delmelle

On Feb 6, 2006, at 08:17, Manuel Mall wrote:


[ME:]

snip/

A preserved carriage return can be treated the same way as a
linefeed, under the very exceptional condition that it survives  
white-

space handling:
  * white-space-treatment=ignore-if-*
  * the CR does not follow/precede a linefeed
  * it is the first character in a sequence of whitespace, so
it survives white-space-collapse



Shouldn't a CR always survive whitespace handling?


Not always:
If white-space-treatment=preserve then any XML whitespace other  
than a linefeed is converted into a normal space. IMO, the editors  
put it this way because of the possibility of Windows-specific line- 
endings, where a linefeed is followed by a CR.



For a starters it is fairly difficult to get a CR out of a XML parser.


Difficult? It's simply a characters event, just like any other...


Only if the CR is hidden in an entity reference can it survive.
Also, as Simon pointed out in some other contribution, whitespace  
handling
is designed to deal with pretty printing and readable XML layout  
introduced
whitespace. A CR preserved by the XML parser certainly does not  
fall into

that category.


Oh yes it does... Remember that not all our users are unix/linux- 
based, which means for Windows users, you're likely to get the  
sequence '#x0A;#x0D;' as line-terminator, while Mac-users saving a  
source file with native line-endings will simply get a '#x0D;'.  
(UTF-8 encoding is recommended, but not enforced... An XML file can  
be any encoding the parser supports on top of the UTF-8 minimum.)


A carriage-return can survive white-space-handling, for instance, in  
the following case (suppose Mac-encoding):


fo:block
  First line, then a CR#x0D; some spaces, and more text
/fo:block

The CR (which isn't necessarily a Numerical Character Reference, but  
could be just the byte '0D') is not converted into a space (white- 
space-treatment=ignore-if-surrounding-linefeed).

It does not precede or follow a linefeed.
It is the first character in a sequence of whitespace, so no matter  
what the value of white-space-collapse, it will survive...


I am also not aware that the XSL-FO spec mentions CR as falling  
under whitespace. IMO

for whitespace handling CR is just a non whitespace character.


Nope, it does fall into the category of XML whitespace. There are  
exactly four of those: #x09; (tab), #x0A; (linefeed), #x0D;  
(carriage-return) and #x20; (space). If you don't believe me, it's  
indeed not in the XSL-FO Rec, but you might want to check the XML  
Recommendation...


So, we only need to consider what fop layout should do if it  
encounters a

CR. I would say, keep it simple, throw it away and log a warning.


Now, what about a tab character under the same circumstances? Do we
use an elastic width of X spaces optimum, where X is purely
conventional?



Similar considerations as for CR apply to TAB.


...

Cheers,

Andreas


Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-06 Thread Andreas L Delmelle

On Feb 6, 2006, at 17:04, Luca Furini wrote:

Hi Manuel / Luca,


Manuel Mall wrote:


IMO yes there can be a break and no only the space needs to be  
removed. Again the argument is that nbsp is not whitespace as per  
XSL-FO definition and need not to be removed.


What makes you think that both the nbsp and the space needs to be  
removed around a fop generated linebreak?


Oops, I forgot to add an importand condition: if the user  
explicitly states that the nsbp must be discarded around a line break:

  fo:inline suppress-at-line-break=suppressnbsp;/fo:inline


Oops, typo? suppress-at-line-break is a non-inherited property, only  
applicable to fo:character :-)


Well, the more I look at this, the more it seems unlikely to ever  
happen ... we are probably having a highly theoretical  
disquisition! :-)


fo:character character=#xA0; suppress-at-line-break=suppress /

followed by a space is indeed very theoretical.

So is (another alternative):

fo:inline suppress-at-line-break=suppress
  fo:character character=#xA0;
suppress-at-line-break=inherit / /fo:inline

OTOH, if we can make the algorithm work in these exotic cases, then  
the commonly used scenarios will be a cake-walk. :-)


This does, in any case, shed some different light on the notion of  
'pretty printing whitespace', since currently --at least that was my  
understanding of the discussions, and that's what I worked towards--  
a fo:character is considered the same as a regular character, in that  
fo:characters representing XML whitespace are subject to whitespace- 
removal... Yet, one can arguably defend the idea that any  
*fo:*character is inserted for *XML* pretty printing purposes, no?  
Should this change be reverted then?

[Maybe partly, because suppose:

fo:block
  fo:character character=#x20; suppress-at-line-break=retain /
...

Currently, the fact that it is a fo:character is not known when  
running this through the algorithm. The CharIterators deal with the  
characters. The XMLWhiteSpaceHandler makes a decision based purely on  
the value of the character property. It is agnostic to the suppress- 
at-line-break property's value... I myself would tend to use a non- 
breaking space in this case, since it escapes the whitespace  
handling, but it is a theoretical possibility. :-)


Another alternative would be to introduce a member to the  
CharIterators...

Something like isSuppressible(), which would return true if:
( the current element is a regular character
  and it has codepoint U+0020 )
or ( the current element is a fo:character
  and
  (( the value of its character property is codepoint U+0020
and suppress-at-line-break=auto )
  or ( suppress-at-line-break=suppress ))

As such, refinement (white-space)-character-removal could operate on  
this basis, and already resolve such issues at that stage.


The current approach is still not 100% correct anyway...]

Anyway, I was still not sure whether there could be a break so I  
looked back at the Unicode Annex #14.

snip /
So, it seems there could be a break between SPACE and NBSP (with  
NBSP starting the next line), but not between NBSP and SPACE. Can  
we say this is settled?


Yes! Definitely. We're looking for UAX#14 'compliance' as well here.

My 2 cents.

Cheers,

Andreas



Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-06 Thread Andreas L Delmelle

On Feb 6, 2006, at 19:40, Andreas L Delmelle wrote:

Currently, the fact that it is a fo:character is not known when  
running this through the algorithm. The CharIterators deal with the  
characters.


Say... I was just wondering: why does the TextLayoutManager create  
its own copy of the FOText's character array? Could the LMs be made  
to re-use the CharIterators' functionality to get to the characters,  
or would that mean a draw on performance somehow?


Anyone?

Cheers,

Andreas



Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-05 Thread Andreas L Delmelle

On Feb 5, 2006, at 14:13, [EMAIL PROTECTED] wrote:

Hi Manuel,

--- Additional Comments From [EMAIL PROTECTED]  2006-02-05  
14:13 ---
Jeremias, no that is not it IMO. Knuth doesn't break between  
elements as such.
The glue or penalty element itself is the break opportunity and is  
discarded
when used as a break. Therefore, IMO we are not breaking before or  
after a

space or NBSP but at the space/NBSP.


OK, IIC you're directing this at the wrong person... The last  
question was mine. :-)


The problem is the coding model used for Knuth element element  
generation for
spaces is flawed. What is done is that the only difference between  
normal space

and NBSP is an infinite penalty at the beginning of the sequence.


Yep. A few other gaps in that coding model, I'm currently looking at.  
See my most recent commit, and change of the white-space Wiki.  
Created some nasty side-effects in exotic situations... currently  
under investigation.
A preserved carriage return can be treated the same way as a  
linefeed, under the very exceptional condition that it survives white- 
space handling:

 * white-space-treatment=ignore-if-*
 * the CR does not follow/precede a linefeed
 * it is the first character in a sequence of whitespace, so
   it survives white-space-collapse

Now, what about a tab character under the same circumstances? Do we  
use an elastic width of X spaces optimum, where X is purely  
conventional?


However, some sequences are pretty long and involve multiple pen- 
glue combinations and
therefore break opportunities further into the sequence. We  
probably need to
separate this more cleanly. Have one function for non breaking  
elastic elements
(e.g. NBSP) and one function for breaking eleastic elements (e.g.  
SPACE). The

non breaking sequences are probably very simple:

1. Justified text: pen INF + elastic glue
2. All other justification modes: either just a box of the width of  
the space

or pen INF + fixed width glue.

Curious what Luca and others think. Are the above two cases OK for  
NBSP or have
I oversimplified and missed something, that is for the text-align  
values other
then justify, that is start, center, end, is it enough to  
just reserve

a fixed width for the NBSP?


Still depends on text-align-last, no?
BTW, is this not one of those situations where it's possible that the  
used font contains a glyph for the NBSP character, so we should check  
that as well?


Cheers,

Andreas



Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output

2006-02-05 Thread Manuel Mall
 On Feb 5, 2006, at 14:13, [EMAIL PROTECTED] wrote:

 Hi Manuel,

 --- Additional Comments From [EMAIL PROTECTED]  2006-02-05
 14:13 ---
snip/
 A preserved carriage return can be treated the same way as a
 linefeed, under the very exceptional condition that it survives white-
 space handling:
   * white-space-treatment=ignore-if-*
   * the CR does not follow/precede a linefeed
   * it is the first character in a sequence of whitespace, so
 it survives white-space-collapse


Shouldn't a CR always survive whitespace handling? For a starters it is
fairly difficult to get a CR out of a XML parser. Only if the CR is hidden
in an entity reference can it survive. Also, as Simon pointed out in some
other contribution, whitespace handling is designed to deal with pretty
printing and readable XML layout introduced whitespace. A CR preserved by
the XML parser certainly does not fall into that category. I am also not
aware that the XSL-FO spec mentions CR as falling under whitespace. IMO
for whitespace handling CR is just a non whitespace character.

So, we only need to consider what fop layout should do if it encounters a
CR. I would say, keep it simple, throw it away and log a warning.

 Now, what about a tab character under the same circumstances? Do we
 use an elastic width of X spaces optimum, where X is purely
 conventional?


Similar considerations as for CR apply to TAB.

Any way both CR and TAB have not much to do with the problem at hand: NBSP
not handled correctly.

snip/
 The non breaking sequences are probably very simple:

 1. Justified text: pen INF + elastic glue
 2. All other justification modes: either just a box of the width of
 the space
 or pen INF + fixed width glue.

 Curious what Luca and others think. Are the above two cases OK for
 NBSP or have I oversimplified and missed something, that is for the
 text-align values other then justify, that
 is start, center, end, is it enough to just reserve
 a fixed width for the NBSP?

 Still depends on text-align-last, no?

Yes correct but even then do the two rules above suffice, i.e. possible
justification required: Rule 1; no justification required: Rule 2?

 BTW, is this not one of those situations where it's possible that the
 used font contains a glyph for the NBSP character, so we should check
 that as well?

Yes but again it has very little to do with the problem. If the font has a
glyph for NBSP we should use that glyphs width and not the SP width in the
glue elements generated. That's all.


 Cheers,

 Andreas


Manuel