Re: [rt-users] RT saves data in quoted-printable, why???
Hi, On Thu, Mar 05, 2015 at 06:37:21PM -0500, Alex Vandiver wrote: > On Fri, 6 Mar 2015 00:06:32 +0100 Václav Ovsík > wrote: > > https://issues.bestpractical.com/Ticket/Display.html?id=29735 > > Aha -- thanks for digging that out! I thought I vaguely recalled > something in this area previously. > https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch > looks to be functionally fairly similar to the branch. Thanks for attention to this... > There are a few other, orthogonal fixes in there that may still be > interesting to tease out into their own commits. It looks like I see > changes to: > > * Fix the computed max size of base64'd attachments; I'd need to > squint at it harder, but seems eminently reasonable. > > * Attempt to gracefully deal with TruncateLongAttachments truncating > mid-byte of UTF-8 data. As above; the decode/encode is an interesting > trick to attempt to ensure that the byte stream is consistent. I'd > like to test it a bit, but seems not unreasonable. It is not too efficient maybe, but easy and safety first :) > * Choose base64 vs QP based on which is shorter; I'm less convinced by > this, since it means that for large data, it gets QP'd, base64'd, and > then one of those _again_ -- which isn't terribly efficient. I'm less > convinced by the tradeoff of computation time to stored in-database > size. You are right. My intention was to gather as much readable text as possible. Maybe a text contains some invalid characters, but the rest of the text is readable, so QP is more appropriate, because it leaves the most of a text readable. So the measuring of length of an encoded data Base64/QP gives a result of how much ASCII chars are there. len Base64 < len QP - many binary data - maybe some octet stream len QP < len Base64 - many ASCII chars - maybe the text But this is corner case probably and it is not very interesting. The most of the text should be UTF-8 valid and the rest is not interesting these days. > If you're interested in reworking the patch into a 2-3 commit series, > I'm happy to apply for 4.2-trunk. > - Alex https://github.com/bestpractical/rt/compare/stable...zito:4.2-zito-encodelob-utf8-fix This is a bit newer version I'm using within production instance rt-4.2.9. I will be happy if some part will be usable for RT mainline. Thanks for fine software! Cheers -- Zito
Re: [rt-users] RT saves data in quoted-printable, why???
On Fri, 6 Mar 2015 00:06:32 +0100 Václav Ovsík wrote: > https://issues.bestpractical.com/Ticket/Display.html?id=29735 Aha -- thanks for digging that out! I thought I vaguely recalled something in this area previously. https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch looks to be functionally fairly similar to the branch. There are a few other, orthogonal fixes in there that may still be interesting to tease out into their own commits. It looks like I see changes to: * Fix the computed max size of base64'd attachments; I'd need to squint at it harder, but seems eminently reasonable. * Attempt to gracefully deal with TruncateLongAttachments truncating mid-byte of UTF-8 data. As above; the decode/encode is an interesting trick to attempt to ensure that the byte stream is consistent. I'd like to test it a bit, but seems not unreasonable. * Choose base64 vs QP based on which is shorter; I'm less convinced by this, since it means that for large data, it gets QP'd, base64'd, and then one of those _again_ -- which isn't terribly efficient. I'm less convinced by the tradeoff of computation time to stored in-database size. If you're interested in reworking the patch into a 2-3 commit series, I'm happy to apply for 4.2-trunk. - Alex
Re: [rt-users] RT saves data in quoted-printable, why???
https://issues.bestpractical.com/Ticket/Display.html?id=29735 -- Zito
Re: [rt-users] RT saves data in quoted-printable, why???
On Thu, 5 Mar 2015 14:31:56 -0500 Alex Vandiver wrote: > I _believe_ that this code may be replaced by (untested): > > } elsif (!$RT::Handle->BinarySafeBLOBs > && !eval { Encode::decode( "UTF-8", $Body, > Encode::FB_CROAK); 1 } ) { > $ContentEncoding = 'quoted-printable'; > } > > I may push a branch later that makes this change, and see what breaks. https://github.com/bestpractical/rt/compare/4.2-trunk...4.2%2Fless-qp - Alex
Re: [rt-users] RT saves data in quoted-printable, why???
On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn > * Ensure that all data containing non-ASCII is quoted-printable > encoded for PostgreSQL, instead of merely all data not claiming to be >"text/plain" > > Why is this? It seems it is doing more harm than good -- it makes it > harder to use indexes and makes it harder to use it from the tables > directly? PostgreSQL is very capable of storing any kind of character > set? We use UTF-8 for everything, this fix breaks a lot of things for > us. The commit in question is 3a9c38ed, which changes: } elsif (!$RT::Handle->BinarySafeBLOBs && $MIMEType !~ /text\/plain/gi && !Encode::is_utf8( $Body, 1 ) ) { $ContentEncoding = 'quoted-printable'; } ...to: } elsif (!$RT::Handle->BinarySafeBLOBs && $Body =~ /\P{ASCII}/ && !Encode::is_utf8( $Body, 1 ) ) { $ContentEncoding = 'quoted-printable'; } That is, any data which claimed to be "text/plain" would blindly be attempted to be shoved into the database; this includes content which contained _invalid_ UTF-8 byte sequences. The commit was in RT 4.0; RT 4.2 is much better about transcoding to real UTF-8, or marking the part as "application/octet-stream" if it cannot be decoded. In that sense, this logic is now less necessary. However, the commit was absolutely necessary at the time to not _lose_ data. Erring on the side of data preservation, at the expense of possibly-unnecessary encoding, seems like not an unreasonable choice. That being said, the Encode::is_utf8() check attempts to verify that we only quoted-printable encode data whose bytes are not currently a correctly-encoded UTF-8 byte stream. The bug now lies there, as after the "utf8-reckoning" branch (2620658..af9fe7c), the $Body argument is guaranteed to contain bytes, not characters. Per the Encode::is_utf8() documentation: [INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If CHECK is true, also checks whether STRING contains well-formed UTF-8. Returns true if successful, false otherwise. The UTF8 flag is almost certainly off on $Body (it _may_ be on, and still contain only codepoints < 128, but this is unlikely), so the well-formed-ness of the string (which is the relevant thing we wish to confirm) is never checked. I _believe_ that this code may be replaced by (untested): } elsif (!$RT::Handle->BinarySafeBLOBs && !eval { Encode::decode( "UTF-8", $Body, Encode::FB_CROAK); 1 } ) { $ContentEncoding = 'quoted-printable'; } I may push a branch later that makes this change, and see what breaks. All of that aside, note that "it makes it harder to use indexes" makes little sense -- there are no indexes on Content. The full-text index uses its own tsvector, which it constructs after decoding the Content, so the transfer encoding of the Content column is irrelevant to that. - Alex
[rt-users] RT saves data in quoted-printable, why???
http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/net/QuotedPrintableCodec.html in the releasen note for RT-4.2.2 [https://www.bestpractical.com/release-notes/rt/4.2.2] * Ensure that all data containing non-ASCII is quoted-printable encoded for PostgreSQL, instead of merely all data not claiming to be "text/plain" Why is this? It seems it is doing more harm than good -- it makes it harder to use indexes and makes it harder to use it from the tables directly? PostgreSQL is very capable of storing any kind of character set? We use UTF-8 for everything, this fix breaks a lot of things for us. I think this was a bad design decision, but maybe I can't see the entire picture? Best regards, Palle signature.asc Description: Message signed with OpenPGP using GPGMail