Re: How do I store and get back all iso8859-15 characters?

Jeremy Evans Thu, 02 Feb 2012 09:58:42 -0800

On Feb 2, 7:29 am, Christian MICHON <[email protected]>
wrote:
> > So I believe there might be a mismatch between what the logger
> > reported and what truly was inserted in H2.
>
> I did some more experiments just now, testing:
> - jdbc + H2: FAIL
> - jdbc + sqlite (sqlitejdbc-v056-pure.jar): FAIL
> - using MRI, sequel (3.31.0) and sqlite3 (1.3.4 x86-mingw32): PASS
>
> So I believe it would be either a jruby issue or a jdbc issue (I mean
> the common jdbc subpart in sequel).
>
> I can revert to older version of jruby to see if this is a regression
> or not, but for the jdbc subpart, I'm not yet familiar enough with the
> code).


Sequel can only report what ruby string is sent to the SQL statement.
Apparently, JDBC (or something below it) monkeys with the string.  I
set TRACE_LEVEL_FILE=2 when connecting to H2 and when inserting "f
\224tter", H2 is showing the inserted value as 'f\ufffdtter'.  When
retrieving the value, it is returned as "f\357\277\275tter".  That's
UTF-8, as I expected:

  "\u{fffd}".encode('UTF-8').force_encoding('BINARY').split('').map{|
c| "%o" % c.ord}
  => ["357", "277", "275"]

Basically, H2 is storing unicode.  Now, unicode codepoint fffd is the
"replacement character".  My guess is that H2 doesn't understand the
\224 character, and instead of raising an exception, silently replaces
it with garbage (the replacement character).  H2 does have a file
encoding setting (http://www.h2database.com/javadoc/org/h2/constant/
SysProperties.html#file.encoding), but supposedly it defaults to
cp1252, which should map \224 to unicode codepoint 201d (in ISO-8859-1
and ISO-8859-15 it maps to unicode codepoint 0094).

This may be a general issue in JRuby, see 
http://jira.codehaus.org/browse/JRUBY-2688.
It was closed, but apparently not for a good reason.

To work around you issue, I recommend the following:

  require 'iconv'
  class String
    def to_utf8
      Iconv.iconv("UTF8", "ISO-8859-15", self).first
    end
    def from_utf8
      Iconv.iconv("ISO-8859-15", "UTF8", self).first
    end
  end

Then you can do:

  o = Word.new
  o.text = "f\224tter".to_utf8
  o.save
  o.first.text.from_utf8

In terms of making the integration transparent, I would add an
after_initialize hook for existing records ("unless new?"), that
called from_utf8 on all text values, that will make sure the values
are converted from UTF-8 to ISO-8859-1 when retrieving them from the
database.  For fixing values going into the database, you could use a
before_save hook that did to_utf8 on all text values, and an an
after_save that did from_utf8.  That should work in most cases, though
it probably won't handle error conditions correctly.  You'd need to
use around_save to handle error conditions correctly, or handle things
at a lower level such as overriding literal_string:

  Word.dataset_module do
    def literal_string(s)
      super(s.to_utf8)
    end
  end

Overriding literal_string might have other consequences, but it may be
the best way to handle things.

Anyway, hope this gives you some useful information.

Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.

Re: How do I store and get back all iso8859-15 characters?

Reply via email to