Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-06 Thread Václav Ovsík
Hi,

On Thu, Mar 05, 2015 at 06:37:21PM -0500, Alex Vandiver wrote:
> On Fri, 6 Mar 2015 00:06:32 +0100 Václav Ovsík 
> wrote:
> > https://issues.bestpractical.com/Ticket/Display.html?id=29735
> 
> Aha -- thanks for digging that out!  I thought I vaguely recalled
> something in this area previously.
> https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch
> looks to be functionally fairly similar to the branch.

Thanks for attention to this...

> There are a few other, orthogonal fixes in there that may still be
> interesting to tease out into their own commits.  It looks like I see
> changes to:
> 
>  * Fix the computed max size of base64'd attachments; I'd need to
> squint at it harder, but seems eminently reasonable.
> 
>  * Attempt to gracefully deal with TruncateLongAttachments truncating
> mid-byte of UTF-8 data.  As above; the decode/encode is an interesting
> trick to attempt to ensure that the byte stream is consistent.  I'd
> like to test it a bit, but seems not unreasonable.

It is not too efficient maybe, but easy and safety first :)

>  * Choose base64 vs QP based on which is shorter; I'm less convinced by
> this, since it means that for large data, it gets QP'd, base64'd, and
> then one of those _again_ -- which isn't terribly efficient.  I'm less
> convinced by the tradeoff of computation time to stored in-database
> size.

You are right. My intention was to gather as much readable text as
possible. Maybe a text contains some invalid characters, but the rest
of the text is readable, so QP is more appropriate, because it leaves
the most of a text readable.
So the measuring of length of an encoded data Base64/QP gives a result of
how much ASCII chars are there.
 len Base64 < len QP - many binary data - maybe some octet stream
 len QP < len Base64 - many ASCII chars - maybe the text

But this is corner case probably and it is not very interesting.
The most of the text should be UTF-8 valid and the rest is not
interesting these days.

> If you're interested in reworking the patch into a 2-3 commit series,
> I'm happy to apply for 4.2-trunk.
>  - Alex

https://github.com/bestpractical/rt/compare/stable...zito:4.2-zito-encodelob-utf8-fix
This is a bit newer version I'm using within production instance rt-4.2.9.
I will be happy if some part will be usable for RT mainline.

Thanks for fine software!
Cheers
-- 
Zito


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Alex Vandiver
On Fri, 6 Mar 2015 00:06:32 +0100 Václav Ovsík 
wrote:
> https://issues.bestpractical.com/Ticket/Display.html?id=29735

Aha -- thanks for digging that out!  I thought I vaguely recalled
something in this area previously.
https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch
looks to be functionally fairly similar to the branch.

There are a few other, orthogonal fixes in there that may still be
interesting to tease out into their own commits.  It looks like I see
changes to:

 * Fix the computed max size of base64'd attachments; I'd need to
squint at it harder, but seems eminently reasonable.

 * Attempt to gracefully deal with TruncateLongAttachments truncating
mid-byte of UTF-8 data.  As above; the decode/encode is an interesting
trick to attempt to ensure that the byte stream is consistent.  I'd
like to test it a bit, but seems not unreasonable.

 * Choose base64 vs QP based on which is shorter; I'm less convinced by
this, since it means that for large data, it gets QP'd, base64'd, and
then one of those _again_ -- which isn't terribly efficient.  I'm less
convinced by the tradeoff of computation time to stored in-database
size.

If you're interested in reworking the patch into a 2-3 commit series,
I'm happy to apply for 4.2-trunk.
 - Alex


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Václav Ovsík
https://issues.bestpractical.com/Ticket/Display.html?id=29735
-- 
Zito


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Alex Vandiver
On Thu, 5 Mar 2015 14:31:56 -0500 Alex Vandiver
 wrote:
> I _believe_ that this code may be replaced by (untested):
> 
> } elsif (!$RT::Handle->BinarySafeBLOBs
>   && !eval { Encode::decode( "UTF-8", $Body,
> Encode::FB_CROAK); 1 } ) {
> $ContentEncoding = 'quoted-printable';
> }
> 
> I may push a branch later that makes this change, and see what breaks.

https://github.com/bestpractical/rt/compare/4.2-trunk...4.2%2Fless-qp

 - Alex


Re: [rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Alex Vandiver
On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn
> * Ensure that all data containing non-ASCII is quoted-printable
> encoded for PostgreSQL, instead of merely all data not claiming to be
>"text/plain"
> 
> Why is this? It seems it is doing more harm than good -- it makes it
> harder to use indexes and makes it harder to use it from the tables
> directly? PostgreSQL is very capable of storing any kind of character
> set? We use UTF-8 for everything, this fix breaks a lot of things for
> us.

The commit in question is 3a9c38ed, which changes:

} elsif (!$RT::Handle->BinarySafeBLOBs
  && $MIMEType !~ /text\/plain/gi
  && !Encode::is_utf8( $Body, 1 ) ) {
  $ContentEncoding = 'quoted-printable';
}

...to:

} elsif (!$RT::Handle->BinarySafeBLOBs
  && $Body =~ /\P{ASCII}/
  && !Encode::is_utf8( $Body, 1 ) ) {
  $ContentEncoding = 'quoted-printable';
}

That is, any data which claimed to be "text/plain" would blindly be
attempted to be shoved into the database; this includes content which
contained _invalid_ UTF-8 byte sequences.  The commit was in RT 4.0; RT
4.2 is much better about transcoding to real UTF-8, or marking the part
as "application/octet-stream" if it cannot be decoded.  In that sense,
this logic is now less necessary.  However, the commit was absolutely
necessary at the time to not _lose_ data.  Erring on the side of data
preservation, at the expense of possibly-unnecessary encoding, seems
like not an unreasonable choice.

That being said, the Encode::is_utf8() check attempts to verify that we
only quoted-printable encode data whose bytes are not currently a
correctly-encoded UTF-8 byte stream.  The bug now lies there, as after
the "utf8-reckoning" branch (2620658..af9fe7c), the $Body argument is
guaranteed to contain bytes, not characters.  Per the Encode::is_utf8()
documentation:

[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If
CHECK is true, also checks whether STRING contains well-formed UTF-8.
Returns true if successful, false otherwise.

The UTF8 flag is almost certainly off on $Body (it _may_ be on, and
still contain only codepoints < 128, but this is unlikely), so the
well-formed-ness of the string (which is the relevant thing we wish to
confirm) is never checked.

I _believe_ that this code may be replaced by (untested):

} elsif (!$RT::Handle->BinarySafeBLOBs
  && !eval { Encode::decode( "UTF-8", $Body, Encode::FB_CROAK); 1 } 
) {
  $ContentEncoding = 'quoted-printable';
}

I may push a branch later that makes this change, and see what breaks.



All of that aside, note that "it makes it harder to use indexes" makes
little sense -- there are no indexes on Content.  The full-text index
uses its own tsvector, which it constructs after decoding the Content,
so the transfer encoding of the Content column is irrelevant to that.

 - Alex


[rt-users] RT saves data in quoted-printable, why???

2015-03-05 Thread Palle Girgensohn
http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/net/QuotedPrintableCodec.html


in the releasen note for RT-4.2.2 
[https://www.bestpractical.com/release-notes/rt/4.2.2]

* Ensure that all data containing non-ASCII is quoted-printable encoded
   for PostgreSQL, instead of merely all data not claiming to be
   "text/plain"


Why is this? It seems it is doing more harm than good -- it makes it harder to 
use indexes and makes it harder to use it from the tables directly? PostgreSQL 
is very capable of storing any kind of character set? We use UTF-8 for 
everything, this fix breaks a lot of things for us.

I think this was a bad design decision, but maybe I can't see the entire 
picture?

Best regards,
Palle



signature.asc
Description: Message signed with OpenPGP using GPGMail