-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 21.05.2014 20:05, Greg Hellings wrote: >> The RTFHTML filter code appears to incorrectly parse the >> following strings: >> >> "\u-999999" -> getUTF8FromUniChar(48577) "\u-99999" -> >> getUTF8FromUniChar(31073) "\u-0001" -> getUTF8FromUniChar(65535) >> "\u-00" -> getUTF8FromUniChar(0) "\u-0" -> getUTF8FromUniChar(0) >> "\u00" -> getUTF8FromUniChar(0) "\u001" -> getUTF8FromUniChar(1) >> "\u99999" -> getUTF8FromUniChar(34463) "\u-" -> >> getUTF8FromUniChar(0) "\u--" -> getUTF8FromUniChar(0) "\u--2" -> >> getUTF8FromUniChar(0) "\u-a" -> getUTF8FromUniChar(0) >> >> I think all these should instead fail. > > The last three should return -, 2, and a respectively if I read the > wiki page correctly that allows a final character to use when the > conversion otherwise won't work.
Ok, I missed the final character thing in both the wiki and the RTF spec. With respect to our context, the RTF spec points that \u{num} should immediately followed by an CP1252 character with, optionally, a keyword-delimiting space between the control word and that character. If another control word or control symbol follows, it is considered to be that character. But I think you are incorrect. "\u--", "\u--2" and "\u-a" should fail because there is NO signed number. Just "-" is NOT a number. Neither is "--2", which is an expression at best. I don't see any good reason why we should allow stuff like "\u------2" or "\u++-+-++-0". > Why you think the signed values that are zero prefixed should fail > I don't understand. Because 0-prefixed representations are in most contexts considered to be octal. > Those which fall beyond the range of a sixteen bit integer are the > only ones I might agree should fall. However, since Unicode now > exceeds sixteen bits, think it is our limitation that ought to > change. We can't change our limitation, because then we don't have RTF any more. And as I understand you want backwards compatibility. >> 4) \par incorrectly appends a newline. > > Why is a newline incorrect? Newlines are mostly ignored in HTML. There is good no reason to append one. Its useless extra data. Even for debugging purposes. Or, as a counterexample: why not instead append a pattern of 3 spaces followed a newline repeated 4 times? - it would be ignored anyway. IMHO, it lacks usefulness. >> 5) "a\qc b" is converted to "a<center> b", but should instead be >> "a<center>b</center>" (' ' RTF delimiter output, missing HTML >> </center> tag) >> >> 6) "a\par b" is converted to "a<p/> b", but should probably be >> "<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p> >> and </p> tags. >> >> 7) Weird combinations of \par, \pard and \qc result in broken >> HTML fragments or HTML fragments with unbalanced start and end >> tags. > > I don't believe the contract of this filter guarantees valid HTML, > and HTML allows unbalanced tags. In fact it is preferred in some > older HTML specs for certain tags, p a prominent example of such > tags. IMHO you're wrong. At least it is not a valid XHTML (XML) or HTML 5 balanced fragment. I'm not completely sure about earlier HTML standards. The HTML 5 draft provides a guide on how to handle such invalid cases, but these are not considered "valid". And as such, one of two things should happen: 1) we should output valid HTML, 2) users of RTFHTML must fix the output or at least ensure the output is placed properly, so it doesn't interfere with any HTML places after the output. >> 8) Unsupported control sequences do not cause the function to >> fail, but are passed to output as plain text (including the >> backslash). >> >> 8) Unescaped '{', '}' and '\' characters are not handled properly >> (to pass these from RTF one would need to use the control symbols >> "\{", "\}" and "\\" respectively). > > > The rest of your objections seem to be based on a different > objective than SWORD filter objectives. The prose is not to force > compliance to a strict spec but instead to give a "best effort" > attempt at conversion. The same way that most browsers will accept > invalid input but make a best effort to display (unescaped & > characters will usually display as is and invalid nesting such as > having a div inside of a p tag still works out somewhat reasonably) > the SWORD engine is lax in what it accepts. > > It follows the general maxim "be strict in what you produce but > lenient in what you accept." Crosswire produced content should not > include such invalid input, but the engine is intentionally written > to make a best attempt to handle innocuous invalid input. This is > because we want to encourage as many people as possible to use the > engine even if they are not strict in what they produce. > >> If there are existing modules with bad content or in places where >> the filters are producing invalid output we should fix it, but we >> don't need to go and get stringent about the conversion throwing >> errors or the like because of an invalid control sequence or an >> unknown Unicode character. I have two big with these arguments. First, we have the current implementation RTFHTML and the wiki page documenting its behaviour, and these don't match. The current implementation does NOT properly parse valid RTF. For example, it fails to take into account RTF delimeters, e.g. "\u8364 ab" should output "€b" if the unicode character can be displayed, and "ab" otherwise. But currently (I think) it outputs getUTF8FromUniChar(8364) followed by " ab". Both ignoring the following character and the delimiter. Secondly, I've not seen a piece of documentation about accepting lax behaviour. Even if we do accept invalid input, we should do it in a safe and secure manner which does not result in invalid output. Additionally, there should be a switch whether "best effort" processing on lax input is allowed. Currently RTFHTML always succeeds without a warning. Blessings, Jaak -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQgcBAEBAgAGBQJTfdJ5AAoJELozJlbjIn791TxAAJeiwVhpWdOIfKqcxm3l0R1P GqDUB05kqsfNtoOMPZmZb8raKuPw6kol0ysN7e9xLMGdcd0Gp8/rxPY3O/WXhyOk RWSKYIGpPXyAE4ZMTz9O96jMQsJILxHTGwxmtogwD9XBotLrp4nlnbv4CNpkKd60 qNU9V/hCim2c/qMOokwwUZ5LskKV/5hENO560xdGOMM2ItfUpnBATb+wQQ0CBIba ZGddbAEb+aVZDWEBjajurVtCH9jlcwf7iyCbh54f+UQ8KmgBhE4rZUVHcoy49s16 sEFCQ8wOAcovJCI6+py5W49TneR8eCIGojLGFyvzFjt5pvXZCZkJV/tchkuxQxpR nswvNpD9DefWul8IsTlYy31zhJZu7FDvB2cB2p9z+pD+ndz/Zox5XrKOaWZRVm9u xmCqWB/K0jl2M4RwCvNgaafjDUTasOW7z+qU5d/n44/ohJA8w+GekYSyJPUUIaXJ 0Ki3OCZVAryJnQZqO1oTiFnpWDionQFsLvQJil8JzGbKLhwiUUXJb5Dx9dGIMBEa QVtZsIBjRn1PjM3PVpJ5pdMYNhPuaygLRskafZ6WfwsAr17mOTE8m/MBb6cqcQQA vE7RYYr7qUFi2aDCmSK62xO7eR1QwgIFJW46QPeeuHNhu15KNtsueOOGgqHepChU qK+XnigHZevCrV05K1cNc/F3M2HiunUL+6RgbY9/kBOOvN2xMo8lAowxwawcn6fY 1jvz3t8npf3s6Ki7BV+ccktdhyVVFKuobuARrBWcE34H8weRfXqdgsUOUmy3V7jk fGGbc9YlWxy8VDu4gbE2hjyw7aGVOf8tMy4Q0SbgkkBd6AmGBkVJOl5MZET4/EsU QQ/2c35WTkjodLjLbV0j2/TaOCdzq4D8A5g2Vxt8lPV1LNjG4MqT/8XmrNsQDoDu Rih1p0DricY9TQrLpvsyZfvIiT2fjTkOiW1jYWdJmtm15Ud3FWkbJw6BMyFV5NTJ tz7531Ku/JRrHKfxg2feSWX37ZfdhIshUs+xD3NPQTDlWrmS527/+dDI5EBd+D+L 7obK0HU/bLV+nXN+U42GsmtJyh5MgEGmmHJDXOPTKoSmawl5v6WxFtbZD641XKzu pG3HzePbK9GDYJvKAWfFlnnRBmQnD7IUKJzh8Rh8A0VdC1qKOYlMpMZ8/MkB+Ak5 Rxgz/NVOhFewe0f+2sLE1OCYUQChfcZEBW9WEd+rUu70CKejZjRBJF/HiwjzyzTD qy0SmfwbK1NTRR0409berIjIxNTCpN5OBQ+1upYtUQe90o4pjFlwTX1B/In+u5ox 5u8iTKFv1vqF7+7oFQIymZQl8IHSEkBoPL8HmZxD+DI++J4cSbRzIkDAoNagQias NU2/MYhI0kMS5Kfpe5twwwZ8kp0dxmY8wK/5bEEHKKfI9oBxmNSYWD8Swp/TiEyM 7lpkUmOL8hKVDUHxaeodDsZo3SmnnHr5F++ltbrGzvT9YnEQ6ai2oBetfm6zD5Zd 1n7/JlHmbqhn4w5U07njP9wN2pCvRG8v5xllE+wEDBrZNgHNWY6I6rKIn3+D0Rpj xg7YmgLrtNoRWfBm0H3RI27CsmvIQx2fzrxfgeZ/LbnjFAwy3w26UwoSYvtGFnr8 VFZ3VMOzn3U7Ttm8yE6xrumHaxN7WOQRJ3/YRN6fgOpV8Hx4bWkxAUi9wK7Qz77V by0YzdrEt4pTxr/sruUbH3HafvKWSPtQGoT03CWp5ozDBsAq1ESqP68RWoqgJv4d OLah93mnr/2rBQGTfR3Yg32rl9azdUBFnlzokpboYpIblT/s2R+jrADHkvejEG7W VLbmxwgOw+xyxalRQgczbZqEFSpH3ZVE1yKVgm61eBoWGgnXq9ppd621phZfVvNF xaCmO3uInc1KSpeuj50kaiMRdIrQv938AkXRjx7K+6QZ3y/GuV6U6Z6MJ+JcF37t B7JHhm/QmituX5FOykul4GvJbepoZTe1meztmEuxiT3S6LjBnKHRSIBvNq9r0Gzt tHZWGc0wPpHddwUE8qqnIHI5LPVVII0GpxLv7DDgyQVe/oDgZb4kvTc3PjNXVD42 6AB2rPEeJWI/yH1JNukhnAh6IgofvgYXR4AYlr8j0395uwbSRdjNxDpgqJKu3nVA KGcGF7MXLdDZzU973XUUTlvPzT4lch2GWtiCNTcjAaLTo4fm3KJVxDQeDHYN4NvO YkYps/pmDmaXcvVQq6Kv2i4spjN5pwky5Jf75XkBMTqzbLvvH/9uzM4A+qZJj8tN sPGBbyYOrvdksNccXVgdQC1joOKRfbXetnGsLElvkDCCekZZlF5eDWFt+TdlMSMQ 2QRq6TgV4nC+1Jfx6J/uLdWioBU//kJM5PEYOrEDoKRX4+ZebGC4k1fTwLd4mu/6 gWJQ0oIO4duesOv0Rtb0/JETuevgUzDcIUTBRAww/0+wnmD3dyDAZE3WAIlMcVyC k+8IWq7LS3OO3/LXcFyQHbUs8nLnrQH/9cTPCDZJkltSOuHY9ZO1mkVL6Z6GILJj zIPGcarNijagw1WagcAh1mDfK9SIi+AgYgcXc8Lcv+QNDogY1sBKc+9godHoXFKL 47ssnce4gnLFo8I13Epin8aUKucQ++OyPpv5LpxcqVKqCyeddXLH3gp927l5/sBT RZNich38Eg520YKpN8FKisguwi4IW7u+NylJITVDeQ93eqx9m90d+7OMVieN147u bxb2y04zhJz8pqKM73eS =GRPO -----END PGP SIGNATURE----- _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page