Re: resource encoding troubles

Sven Meier Sat, 20 Sep 2014 15:52:13 -0700

Hi Garret,

I'm glad you found the culprit. Thanks for keeping us updated, we alllearn something new each day.


Have fun
Sven


On 09/20/2014 10:28 PM, Garret Wilson wrote:

Hahahaha! I found the problem!
When I looked at the HomePage.properties file in a hex editor, I waslooking at the HomePage.properties file in my source tree. Butremember that this file isn't the one that Wicket loads! After a Mavenbuild, Wicket will load the HomePage.properties file that Maven copiesthe target directory!! (I should have paid closer attention to the URLused by URLConnection.) And sure enough, when I open that copiedversion of HomePage.properties, it contains the sequence EF BF BD! Inother words, when Maven copied the HomePage.properties file from thesource tree to the target directory, it must have opened it up asUTF-8, converting the A9 © character (not valid UTF-8) into EF BF BD,the UTF-8 sequence for U+FFFD, the Unicode replacement character. Thuswhen Wicket came along to read the file from the target directory, it(correctly) loaded it as ISO-8859-1, interpreting EF BF BD as threecharacters, ï¿½.
But why did Maven use UTF-8 when it copied my HomePage.propertiessource file to the target directory? Ummm... because I told it to,sort of:
 <properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
 </properties>

 <build>
 <resources>
 <resource>
 <directory>src/main/resources</directory>
 <filtering>true</filtering>
 <includes>
 <include>**/*.properties</include>
 </includes>
Apparently when Maven copies resources using filtering, it opens andparses them using the ${project.build.sourceEncoding} setting, whichof course I had set to UTF-8. I probably I need to set the "encoding"parameter of the maven-resources-plugin<http://maven.apache.org/plugins/maven-resources-plugin/copy-resources-mojo.html#encoding>.
Argg!! So much pain and agony for such a tiny mistake! But I'm glad Ifound it. I'll fix it... another day. Right now I'm going to grab sometequila and celebrate!!
Have a great rest of the weekend, everybody!

Garret

On 9/20/2014 4:14 PM, Garret Wilson wrote:
I'm finally able to trace the code, and this is getting very odd.
I use a hex editor, and the bytes in the properties file are ... 3DA9 ... (=©), just as I expect.
But when I trace through the Wicket code, theIsoPropertiesFilePropertiesLoader is using a UrlResourceStream whichuses a URLConnection, which under the hood uses a BufferedInputStreamto a FileInputStream. This in turn is wrapped in anotherBufferedInputStream. When the Properties class (fromIsoPropertiesFilePropertiesLoader) parses the file, the internalProperties.LineReader reads into its inByteBuf variable the sequence... 3D EF BF BD ...! As mentioned below, EF BF BD is the UTF-8sequence for U+FFFD, which is the Unicode replacement character.
So it appears that the UrlResourceStream/URLConnection for theproperties file is somewhere trying to open the stream as UTF-8.Therefore the A9 © character gets converted into the EF BF BDsequence before it even gets to the parser inIsoPropertiesFilePropertiesLoader/Properties!
But what would be causing the UrlResourceStream/URLConnection todefault to UTF-8 when opening my properties file? This seems to bethe answer that lies at the heart of this problem. Is there someWicket or Java setting that is defaulting a URLConnection to useUTF-8 encoding? (As I mentioned above, the underlying input streamseems to be a FileInputStream wrapped in two layers ofBufferedInputStream.)
Garret

On 8/29/2014 1:15 PM, Garret Wilson wrote:
Hi, all. Thanks Andrew for that attempt to reproduce this. I haveverified this on Wicket 6.16.0 and 7.0.0-M2.
I have checked out the latest code fromhttps://git-wip-us.apache.org/repos/asf/wicket.git . I was going totrace this down in the code, but then I was stopped in my trackswith an Eclipse m2e bug<https://bugs.eclipse.org/bugs/show_bug.cgi?id=371618> that won'teven let me clean/compile the project. Argg!! Always something, huh?
But I did start looking in the code. IsoPropertiesFileLoader lookscompletely OK; it uses Properties.load(InputStream), and the fileeven indicates that the input encoding must be ISO-8859-1. Not muchcould go wrong there. I back-referenced the calls up the chain toWicketMessageTagHandler.onComponentTag(Component, ComponentTag), andit looks straightforward there---but that's for message tags, notmessage body.
I investigated downwards from WicketMessageResolver.resolve(...)(which I presume is what is at play here), which has this code:
 MessageContainer label = new MessageContainer(id, messageKey);
The MessageContainer.onComponentTagBody(...) simply looks up thevalue and calls renderMessage(), which in turn does some complicated${var} replacement using MapVariableInterpolator and then write outthe result using getResponse().write(text). UnlessMapVariableInterpolator messes up the value during variablereplacement (but there are no variables to replace in thissituation), then on the surface everything looks OK.
So I decided to do an experiment; I changed the HTML to this:

 This a © copyright. <wicket:message key="copyright">dummy
 text</wicket:message>

And I changed the properties to this:

 copyright=This a © copyright.


Here is what was produced:

 This a © copyright. This a ï¿½ copyright.
So something is going on here in the generation of the includedmessage, because as you can see the content from XML gets producedcorrectly. It turns out <http://stackoverflow.com/a/6367675/421049>that ï¿½ is the UTF-8 sequence for U+FFFD, which is the Unicodereplacement character when an invalid UTF-8 sequence is encountered.And of course, the copyright symbol U+00A9 is not a valid UTF-8value, even thought it is fine as part of ISO-8859-1.
So here is the problem: something is taking the string generated bythe message (which was parsed correctly from the properties file)and writing it to the output stream, not in UTF-8 as it should, butin some other encoding. If I were to guess here, I would say thatthe embedded message is writing out in Windows cp1252 (more or lessISO-8859-1), which is my default encoding (which would explain whyAndrew didn't see this, if his system is Linux and the defaultencoding happens to be UTF-8 for example). This seems incorrect tome; the embedded message should know that it is writing into a UTF-8output stream and should use that instead of the system encoding.
Remember that I can't even compile the code because of an m2e bug,so all of this is highly conjectural, just from visually inspectingthe code and doing a few experiments. But I have a hunch that if youswitch to a machine that has a default system encoding that isn'tUTF-8, you'll reproduce this issue. And I further predict that ifyou trace through the code, the embedded <wicket:message> tag isincorrectly injecting its contents using the system encoding ratherthan the entire output stream encoding (however that is configuredin Wicket). Put another way, whatever is producing the bytes fromthe main HTML page is using UTF-8 (as it should), but whatever istaking the message tag output is spitting out its bytes using cp1252or something similar.
As soon as I can get Eclipse to be happier with the Wicket build,I'll give you some more exact details. But I'll have to take a breakand get back to main my work for a while---we're nearing a bigdeadline and I have some actual functionality to implement! :)
Thanks again for investigating, Andrew.

Garret

On 8/28/2014 8:22 PM, Andrew Geery wrote:
I created a Wicket quickstart (from
http://wicket.apache.org/start/quickstart.html) [this is Wicket6.16.0] and
made two simple changes:

1) I created a HomePage.properties file, encoded as ISO-8859-1, with a
single line as per the example above: copyright=© 2014 Example, Inc.

2) I added a line to the HomePage.html file as per the example
above: <wicket:message key="copyright">©
Example</wicket:message>

The content is served as UTF-8 and the copyright symbol is rendered
correctly on the page.
It doesn't look like the problem is in Wicket (at least not in6.16). Iguess your next steps would be to verify that you get the sameresults and,assuming that you do, start removing things from your page that hasthe
problem until you find an element that is causing the problem.

Thanks
Andrew
On Thu, Aug 28, 2014 at 5:38 PM, Garret Wilson<gar...@globalmentor.com>
wrote:
On 8/28/2014 12:08 PM, Sven Meier wrote:
...
My configuration, as far as I can tell, is correct.
 From what you've written, I'd agree.

You should create a quickstart. This will easily allow us to find a
possible bug.
Better than that, I'd like to trace down the bug, fix it, and file a
patch. But currently I'm blocked from working with Wicket onEclipse <
https://issues.apache.org/jira/browse/WICKET-5649>.

Garret
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@wicket.apache.org
For additional commands, e-mail: users-h...@wicket.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@wicket.apache.org
For additional commands, e-mail: users-h...@wicket.apache.org

Re: resource encoding troubles

Reply via email to