Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-06 Thread Václav Ovsík
Hi,

On Thu, Mar 05, 2015 at 06:37:21PM -0500, Alex Vandiver wrote:
 On Fri, 6 Mar 2015 00:06:32 +0100 Václav Ovsík vaclav.ov...@i.cz
 wrote:
  https://issues.bestpractical.com/Ticket/Display.html?id=29735
 
 Aha -- thanks for digging that out!  I thought I vaguely recalled
 something in this area previously.
 https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch
 looks to be functionally fairly similar to the branch.

Thanks for attention to this...

 There are a few other, orthogonal fixes in there that may still be
 interesting to tease out into their own commits.  It looks like I see
 changes to:
 
  * Fix the computed max size of base64'd attachments; I'd need to
 squint at it harder, but seems eminently reasonable.
 
  * Attempt to gracefully deal with TruncateLongAttachments truncating
 mid-byte of UTF-8 data.  As above; the decode/encode is an interesting
 trick to attempt to ensure that the byte stream is consistent.  I'd
 like to test it a bit, but seems not unreasonable.

It is not too efficient maybe, but easy and safety first :)

  * Choose base64 vs QP based on which is shorter; I'm less convinced by
 this, since it means that for large data, it gets QP'd, base64'd, and
 then one of those _again_ -- which isn't terribly efficient.  I'm less
 convinced by the tradeoff of computation time to stored in-database
 size.

You are right. My intention was to gather as much readable text as
possible. Maybe a text contains some invalid characters, but the rest
of the text is readable, so QP is more appropriate, because it leaves
the most of a text readable.
So the measuring of length of an encoded data Base64/QP gives a result of
how much ASCII chars are there.
 len Base64  len QP - many binary data - maybe some octet stream
 len QP  len Base64 - many ASCII chars - maybe the text

But this is corner case probably and it is not very interesting.
The most of the text should be UTF-8 valid and the rest is not
interesting these days.

 If you're interested in reworking the patch into a 2-3 commit series,
 I'm happy to apply for 4.2-trunk.
  - Alex

https://github.com/bestpractical/rt/compare/stable...zito:4.2-zito-encodelob-utf8-fix
This is a bit newer version I'm using within production instance rt-4.2.9.
I will be happy if some part will be usable for RT mainline.

Thanks for fine software!
Cheers
-- 
Zito


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Alex Vandiver
On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn
 * Ensure that all data containing non-ASCII is quoted-printable
 encoded for PostgreSQL, instead of merely all data not claiming to be
text/plain
 
 Why is this? It seems it is doing more harm than good -- it makes it
 harder to use indexes and makes it harder to use it from the tables
 directly? PostgreSQL is very capable of storing any kind of character
 set? We use UTF-8 for everything, this fix breaks a lot of things for
 us.

The commit in question is 3a9c38ed, which changes:

} elsif (!$RT::Handle-BinarySafeBLOBs
   $MIMEType !~ /text\/plain/gi
   !Encode::is_utf8( $Body, 1 ) ) {
  $ContentEncoding = 'quoted-printable';
}

...to:

} elsif (!$RT::Handle-BinarySafeBLOBs
   $Body =~ /\P{ASCII}/
   !Encode::is_utf8( $Body, 1 ) ) {
  $ContentEncoding = 'quoted-printable';
}

That is, any data which claimed to be text/plain would blindly be
attempted to be shoved into the database; this includes content which
contained _invalid_ UTF-8 byte sequences.  The commit was in RT 4.0; RT
4.2 is much better about transcoding to real UTF-8, or marking the part
as application/octet-stream if it cannot be decoded.  In that sense,
this logic is now less necessary.  However, the commit was absolutely
necessary at the time to not _lose_ data.  Erring on the side of data
preservation, at the expense of possibly-unnecessary encoding, seems
like not an unreasonable choice.

That being said, the Encode::is_utf8() check attempts to verify that we
only quoted-printable encode data whose bytes are not currently a
correctly-encoded UTF-8 byte stream.  The bug now lies there, as after
the utf8-reckoning branch (2620658..af9fe7c), the $Body argument is
guaranteed to contain bytes, not characters.  Per the Encode::is_utf8()
documentation:

[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If
CHECK is true, also checks whether STRING contains well-formed UTF-8.
Returns true if successful, false otherwise.

The UTF8 flag is almost certainly off on $Body (it _may_ be on, and
still contain only codepoints  128, but this is unlikely), so the
well-formed-ness of the string (which is the relevant thing we wish to
confirm) is never checked.

I _believe_ that this code may be replaced by (untested):

} elsif (!$RT::Handle-BinarySafeBLOBs
   !eval { Encode::decode( UTF-8, $Body, Encode::FB_CROAK); 1 } 
) {
  $ContentEncoding = 'quoted-printable';
}

I may push a branch later that makes this change, and see what breaks.



All of that aside, note that it makes it harder to use indexes makes
little sense -- there are no indexes on Content.  The full-text index
uses its own tsvector, which it constructs after decoding the Content,
so the transfer encoding of the Content column is irrelevant to that.

 - Alex


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Alex Vandiver
On Thu, 5 Mar 2015 14:31:56 -0500 Alex Vandiver
ale...@bestpractical.com wrote:
 I _believe_ that this code may be replaced by (untested):
 
 } elsif (!$RT::Handle-BinarySafeBLOBs
!eval { Encode::decode( UTF-8, $Body,
 Encode::FB_CROAK); 1 } ) {
 $ContentEncoding = 'quoted-printable';
 }
 
 I may push a branch later that makes this change, and see what breaks.

https://github.com/bestpractical/rt/compare/4.2-trunk...4.2%2Fless-qp

 - Alex


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Václav Ovsík
https://issues.bestpractical.com/Ticket/Display.html?id=29735
-- 
Zito


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Alex Vandiver
On Fri, 6 Mar 2015 00:06:32 +0100 Václav Ovsík vaclav.ov...@i.cz
wrote:
 https://issues.bestpractical.com/Ticket/Display.html?id=29735

Aha -- thanks for digging that out!  I thought I vaguely recalled
something in this area previously.
https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch
looks to be functionally fairly similar to the branch.

There are a few other, orthogonal fixes in there that may still be
interesting to tease out into their own commits.  It looks like I see
changes to:

 * Fix the computed max size of base64'd attachments; I'd need to
squint at it harder, but seems eminently reasonable.

 * Attempt to gracefully deal with TruncateLongAttachments truncating
mid-byte of UTF-8 data.  As above; the decode/encode is an interesting
trick to attempt to ensure that the byte stream is consistent.  I'd
like to test it a bit, but seems not unreasonable.

 * Choose base64 vs QP based on which is shorter; I'm less convinced by
this, since it means that for large data, it gets QP'd, base64'd, and
then one of those _again_ -- which isn't terribly efficient.  I'm less
convinced by the tradeoff of computation time to stored in-database
size.

If you're interested in reworking the patch into a 2-3 commit series,
I'm happy to apply for 4.2-trunk.
 - Alex


[rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Palle Girgensohn
http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/net/QuotedPrintableCodec.html


in the releasen note for RT-4.2.2 
[https://www.bestpractical.com/release-notes/rt/4.2.2]

* Ensure that all data containing non-ASCII is quoted-printable encoded
   for PostgreSQL, instead of merely all data not claiming to be
   text/plain


Why is this? It seems it is doing more harm than good -- it makes it harder to 
use indexes and makes it harder to use it from the tables directly? PostgreSQL 
is very capable of storing any kind of character set? We use UTF-8 for 
everything, this fix breaks a lot of things for us.

I think this was a bad design decision, but maybe I can't see the entire 
picture?

Best regards,
Palle



signature.asc
Description: Message signed with OpenPGP using GPGMail