On Thu, Feb 2, 2012 at 6:58 PM, Jeremy Evans <[email protected]> wrote:
> On Feb 2, 7:29 am, Christian MICHON <[email protected]>
> wrote:
>> > So I believe there might be a mismatch between what the logger
>> > reported and what truly was inserted in H2.
>>
>> I did some more experiments just now, testing:
>> - jdbc + H2: FAIL
>> - jdbc + sqlite (sqlitejdbc-v056-pure.jar): FAIL
>> - using MRI, sequel (3.31.0) and sqlite3 (1.3.4 x86-mingw32): PASS
>>
>> So I believe it would be either a jruby issue or a jdbc issue (I mean
>> the common jdbc subpart in sequel).
>>
>> I can revert to older version of jruby to see if this is a regression
>> or not, but for the jdbc subpart, I'm not yet familiar enough with the
>> code).
>
> Sequel can only report what ruby string is sent to the SQL statement.
> Apparently, JDBC (or something below it) monkeys with the string.  I
> set TRACE_LEVEL_FILE=2 when connecting to H2 and when inserting "f
> \224tter", H2 is showing the inserted value as 'f\ufffdtter'.  When
> retrieving the value, it is returned as "f\357\277\275tter".  That's
> UTF-8, as I expected:
>
>  "\u{fffd}".encode('UTF-8').force_encoding('BINARY').split('').map{|
> c| "%o" % c.ord}
>  => ["357", "277", "275"]
>
> Basically, H2 is storing unicode.  Now, unicode codepoint fffd is the
> "replacement character".  My guess is that H2 doesn't understand the
> \224 character, and instead of raising an exception, silently replaces
> it with garbage (the replacement character).  H2 does have a file
> encoding setting (http://www.h2database.com/javadoc/org/h2/constant/
> SysProperties.html#file.encoding), but supposedly it defaults to
> cp1252, which should map \224 to unicode codepoint 201d (in ISO-8859-1
> and ISO-8859-15 it maps to unicode codepoint 0094).
>
> This may be a general issue in JRuby, see 
> http://jira.codehaus.org/browse/JRUBY-2688.
> It was closed, but apparently not for a good reason.
>
> To work around you issue, I recommend the following:
>
>  require 'iconv'
>  class String
>    def to_utf8
>      Iconv.iconv("UTF8", "ISO-8859-15", self).first
>    end
>    def from_utf8
>      Iconv.iconv("ISO-8859-15", "UTF8", self).first
>    end
>  end
>
> Then you can do:
>
>  o = Word.new
>  o.text = "f\224tter".to_utf8
>  o.save
>  o.first.text.from_utf8
>
> In terms of making the integration transparent, I would add an
> after_initialize hook for existing records ("unless new?"), that
> called from_utf8 on all text values, that will make sure the values
> are converted from UTF-8 to ISO-8859-1 when retrieving them from the
> database.  For fixing values going into the database, you could use a
> before_save hook that did to_utf8 on all text values, and an an
> after_save that did from_utf8.  That should work in most cases, though
> it probably won't handle error conditions correctly.  You'd need to
> use around_save to handle error conditions correctly, or handle things
> at a lower level such as overriding literal_string:
>
>  Word.dataset_module do
>    def literal_string(s)
>      super(s.to_utf8)
>    end
>  end
>
> Overriding literal_string might have other consequences, but it may be
> the best way to handle things.
>
> Anyway, hope this gives you some useful information.
>

It does, and it's appreciated.

Yet, I am now getting inside the database something which can only be
read by my jruby app, because the content is not truly encoded as
expected. If I use the H2 console inside a web browser, I do not get
the right content.

I'll investigate more. In the end, I may replace the offending
characters, if any, by properly unicode characters expected by H2.

Thanks

-- 
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.

Reply via email to