Re: script property
Manuel Mall wrote: On Wed, 5 Oct 2005 04:17 pm, Jeremias Maerki wrote: On 05.10.2005 09:46:18 Manuel Mall wrote: While I am at it (this whole alignment stuff I mean) we may as well do it properly. This would include support for the script property. The allowed values for script are defined for example here: http://www.unicode.org/iso15924/iso15924-codes.html. I assume we don't bother to validate if a correct code has been provided as we don't do that for the country and language properties either (should we? If we do we need more external config files or expand fop.xconf to hold those values as they tend to change over time). We don't have to but we could. Since this is not something that changes often I wouldn't put it into the config file, but in resource files instead. OK - makes sense. Validation issues considered in alt-design circa 2002. See CountryLanguageScript.java in the alt-design code for an attempt at this. Generated from xml-lang.xml and xml-lang.xsl. No baselines. Peter -- Peter B. West http://cv.pbw.id.au/ Folio http://defoe.sourceforge.net/folio/ smime.p7s Description: S/MIME Cryptographic Signature
Re: script property
Manuel Mall wrote: What we also need for proper script support is a mapping from Unicode code point to script. On a second thought: isn't this what Class Character.UnicodeBlock does? J.Pietschmann
Re: script property
On Fri, 7 Oct 2005 03:30 am, J.Pietschmann wrote: Manuel Mall wrote: What we also need for proper script support is a mapping from Unicode code point to script. On a second thought: isn't this what Class Character.UnicodeBlock does? Joerg, Thank you - I didn't even know that this class existed. It doesn't quite solve all issues though I think: a) We need a mapping from the ISO 4 letter codes to the Character.UnicodeBlock classes. b) We need a mapping from the Character.UnicodeBlock to script properties (actually at this point in time the only property I am aware off is the default baseline for the script). May be a wrapper around this class to provide that functionality? J.Pietschmann Manuel
script property
While I am at it (this whole alignment stuff I mean) we may as well do it properly. This would include support for the script property. The allowed values for script are defined for example here: http://www.unicode.org/iso15924/iso15924-codes.html. I assume we don't bother to validate if a correct code has been provided as we don't do that for the country and language properties either (should we? If we do we need more external config files or expand fop.xconf to hold those values as they tend to change over time). But what we do need is a mapping from scripts to default baselines for these scripts. I haven't found a mapping list on the net. Any one come across something like that? Otherwise we may have to make that up. That means entries somewhere similar to: script code=Guru baseline=hanging /. Is the fop config file the right place for this stuff? Any not defined scripts encountered in an fo file would map to baseline=alphabetic (may be with a warning to the user?). What we also need for proper script support is a mapping from Unicode code point to script. The mappings are for example defined here: http://www.unicode.org/Public/UNIDATA/Scripts.txt. How would one best process this (has this been done in FOP before?)? Is there other Unicode stuff FOP needs which should be considered at the same time? Are we better off working with the raw Unicode data (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)? Manuel
Re: script property
On 05.10.2005 09:46:18 Manuel Mall wrote: While I am at it (this whole alignment stuff I mean) we may as well do it properly. This would include support for the script property. The allowed values for script are defined for example here: http://www.unicode.org/iso15924/iso15924-codes.html. I assume we don't bother to validate if a correct code has been provided as we don't do that for the country and language properties either (should we? If we do we need more external config files or expand fop.xconf to hold those values as they tend to change over time). We don't have to but we could. Since this is not something that changes often I wouldn't put it into the config file, but in resource files instead. But what we do need is a mapping from scripts to default baselines for these scripts. I haven't found a mapping list on the net. Any one come across something like that? Nope. Otherwise we may have to make that up. That means entries somewhere similar to: script code=Guru baseline=hanging /. Is the fop config file the right place for this stuff? Again, I'd put it in separate resource files as this is not going to change often and a rebuild of FOP is not the end of the world in this case. Any not defined scripts encountered in an fo file would map to baseline=alphabetic (may be with a warning to the user?). Sure. What we also need for proper script support is a mapping from Unicode code point to script. The mappings are for example defined here: http://www.unicode.org/Public/UNIDATA/Scripts.txt. How would one best process this? shrug/ (has this been done in FOP before?) I don't think so. Is there other Unicode stuff FOP needs which should be considered at the same time? Are we better off working with the raw Unicode data (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)? shrug/ We should simply make sure that this doesn't influence performance too much for the big majority of users happy to use latin scripts. After all, this looks like many lookups are necessary and all these maps have to be loaded at one point. Jeremias Maerki
Re: script property
Jeremias Maerki wrote: What we also need for proper script support is a mapping from Unicode code point to script. ... (has this been done in FOP before?) I don't think so. Have a look at http://people.apache.org/~pietsch/linebreak.tar.gz Occasionally I've thought about some sort of Jakarta commons Unicode file component, but the guys there weren't all that enthusiastic about this, and I've not enough time to get the ball rolling all of my own. J.Pietschmann
Re: script property
On Thu, 6 Oct 2005 04:23 am, J.Pietschmann wrote: Jeremias Maerki wrote: What we also need for proper script support is a mapping from Unicode code point to script. ... (has this been done in FOP before?) I don't think so. Have a look at http://people.apache.org/~pietsch/linebreak.tar.gz Occasionally I've thought about some sort of Jakarta commons Unicode file component, but the guys there weren't all that enthusiastic about this, and I've not enough time to get the ball rolling all of my own. Joerg, thanks for that. Do I understand this correctly that you use a Java code generation approach here. That is you generate Java source code from the Unicode text files which is then compiled as part of the line breaking code? Not so sure I like that but then again if it works. For me this type of stuff feels more like pure data but of course we don't want to parse these text files each time FOP loads. What about the hyphenation pattern approach? Store it as a serialized object and treat it more like a resource? Accessing that should be comparable in time to class loading (I think as I haven't ever empirically tested that). I haven't studied your code in detail but could we / should we integrate this into the FOP trunk to support 'Unicode compliant' line breaking? My main goal still is to make FOP happen therefore I wouldn't like to dilute my effort / time in trying to argue / establishing another commons subproject at the moment. What about we create a org.apache.fop.unicode package for the time being where we keep unicode specific support stuff? That can then at a later stage be refactored into a commons subproject if the time/will/energy is there. J.Pietschmann Manuel
Re: script property
On Wed, 5 Oct 2005 04:17 pm, Jeremias Maerki wrote: On 05.10.2005 09:46:18 Manuel Mall wrote: While I am at it (this whole alignment stuff I mean) we may as well do it properly. This would include support for the script property. The allowed values for script are defined for example here: http://www.unicode.org/iso15924/iso15924-codes.html. I assume we don't bother to validate if a correct code has been provided as we don't do that for the country and language properties either (should we? If we do we need more external config files or expand fop.xconf to hold those values as they tend to change over time). We don't have to but we could. Since this is not something that changes often I wouldn't put it into the config file, but in resource files instead. OK - makes sense. But what we do need is a mapping from scripts to default baselines for these scripts. I haven't found a mapping list on the net. Any one come across something like that? Nope. Otherwise we may have to make that up. That means entries somewhere similar to: script code=Guru baseline=hanging /. Is the fop config file the right place for this stuff? Again, I'd put it in separate resource files as this is not going to change often and a rebuild of FOP is not the end of the world in this case. My suggestion was based around the assumption that if we have to make up the mappings from script to baseline ourselves we may get it wrong. Therefore leave it up to the user to add the mappings for his/her language/script environment to the config file. Most users will deal only with a very few scripts so its not a big deal. Any not defined scripts encountered in an fo file would map to baseline=alphabetic (may be with a warning to the user?). Sure. What we also need for proper script support is a mapping from Unicode code point to script. The mappings are for example defined here: http://www.unicode.org/Public/UNIDATA/Scripts.txt. How would one best process this? shrug/ (has this been done in FOP before?) I don't think so. See Joerg's response. Is there other Unicode stuff FOP needs which should be considered at the same time? Are we better off working with the raw Unicode data (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)? shrug/ Seems like line breaking (and hyphenation, e.g. script specific hyphenation character) may also need Unicode stuff (not necessarily from the raw data file though). We should simply make sure that this doesn't influence performance too much for the big majority of users happy to use latin scripts. After all, this looks like many lookups are necessary and all these maps have to be loaded at one point. Yes, that is a valid consideration. May be it needs to be designed in a way that these lookups can be disabled and replaced by defaults from the config file. Jeremias Maerki Manuel