On Thu, Feb 2, 2012 at 6:58 PM, Jeremy Evans <[email protected]> wrote: > On Feb 2, 7:29 am, Christian MICHON <[email protected]> > wrote: >> > So I believe there might be a mismatch between what the logger >> > reported and what truly was inserted in H2. >> >> I did some more experiments just now, testing: >> - jdbc + H2: FAIL >> - jdbc + sqlite (sqlitejdbc-v056-pure.jar): FAIL >> - using MRI, sequel (3.31.0) and sqlite3 (1.3.4 x86-mingw32): PASS >> >> So I believe it would be either a jruby issue or a jdbc issue (I mean >> the common jdbc subpart in sequel). >> >> I can revert to older version of jruby to see if this is a regression >> or not, but for the jdbc subpart, I'm not yet familiar enough with the >> code). > > Sequel can only report what ruby string is sent to the SQL statement. > Apparently, JDBC (or something below it) monkeys with the string. I > set TRACE_LEVEL_FILE=2 when connecting to H2 and when inserting "f > \224tter", H2 is showing the inserted value as 'f\ufffdtter'. When > retrieving the value, it is returned as "f\357\277\275tter". That's > UTF-8, as I expected: > > "\u{fffd}".encode('UTF-8').force_encoding('BINARY').split('').map{| > c| "%o" % c.ord} > => ["357", "277", "275"] > > Basically, H2 is storing unicode. Now, unicode codepoint fffd is the > "replacement character". My guess is that H2 doesn't understand the > \224 character, and instead of raising an exception, silently replaces > it with garbage (the replacement character). H2 does have a file > encoding setting (http://www.h2database.com/javadoc/org/h2/constant/ > SysProperties.html#file.encoding), but supposedly it defaults to > cp1252, which should map \224 to unicode codepoint 201d (in ISO-8859-1 > and ISO-8859-15 it maps to unicode codepoint 0094). > > This may be a general issue in JRuby, see > http://jira.codehaus.org/browse/JRUBY-2688. > It was closed, but apparently not for a good reason. > > To work around you issue, I recommend the following: > > require 'iconv' > class String > def to_utf8 > Iconv.iconv("UTF8", "ISO-8859-15", self).first > end > def from_utf8 > Iconv.iconv("ISO-8859-15", "UTF8", self).first > end > end > > Then you can do: > > o = Word.new > o.text = "f\224tter".to_utf8 > o.save > o.first.text.from_utf8 > > In terms of making the integration transparent, I would add an > after_initialize hook for existing records ("unless new?"), that > called from_utf8 on all text values, that will make sure the values > are converted from UTF-8 to ISO-8859-1 when retrieving them from the > database. For fixing values going into the database, you could use a > before_save hook that did to_utf8 on all text values, and an an > after_save that did from_utf8. That should work in most cases, though > it probably won't handle error conditions correctly. You'd need to > use around_save to handle error conditions correctly, or handle things > at a lower level such as overriding literal_string: > > Word.dataset_module do > def literal_string(s) > super(s.to_utf8) > end > end > > Overriding literal_string might have other consequences, but it may be > the best way to handle things. > > Anyway, hope this gives you some useful information. >
It does, and it's appreciated. Yet, I am now getting inside the database something which can only be read by my jruby app, because the content is not truly encoded as expected. If I use the H2 console inside a web browser, I do not get the right content. I'll investigate more. In the end, I may replace the offending characters, if any, by properly unicode characters expected by H2. Thanks -- Christian -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sequel-talk?hl=en.
