Hi there
I have been dealing with this problem for the last few days I am definately making progress now. I found that the <@ page > tag actually works the form receives information correctly, the only problem now is that the character put when converted to rgular java character set, go in correctly, but the utf-8 representation of these characters are incorrect. Obviously this is due to the fact that the character I am entering in do not have the same character representaion in UTF-8 as they do in ISO-8859-1. so 亞乽乨亪仮 gets converted to 亞乽乨亪仮 i UTF-8. I understand that the above chinese character will not make sense I just picked them out of the windows character set. Now for the question. are the chracter I enter in from the Windows Character Set program, the same as the ones that would be entered in from a native chinese keyboard. If so, why is the conversion not working, If not, where can I get a program that will emulate the chracters that would be entered, so as I can see if the problem lies in Windows or with my conversions. My ultimate aim is to create in which I can set the language on the pages, but have the code itself not have to be changed. This app would need to work in Thai, Chinese as well as Western European Languages. My database is set to use UTF-8. Any further help would be greatly appreciated Regards Steve Vanspall PS: Surely there are application that have had to deal with this before????? -----Original Message----- From: Andrew B. Sudell [mailto:[EMAIL PROTECTED]] Sent: Monday, 15 April 2002 10:57 AM To: [EMAIL PROTECTED] Subject: Internatinalisation Question Steve: I noticed your thread on tomcat-user. Something about it seemed wrong to me. But I didn't but in with an answer, as I can't quite put my finger on it. Mind if I ask a few questions? I've got a hunch there's something real important that's been left unsaid. Steve Vanspall writes: > Ok having tested a bit more, I think I can give a clearer description of my > problem. > > I am currently in the process of making my application multilingual. > > I have succesfully altered my database to be such, and it uses UTF-8 > character set now. > > I have changed the meta-inf tag to set the charset to UTF-8. > > Retrieving information from the database seems to be ok, however, all the > pages have forms for entering/altering data. If I enter foreign characters > into the form the are received in the database as a string of HTML style > character codes. > > e.g. 宕藓퐝 That's were I start feeling things are weird. I'm assuming here you are entering data into a html form, ie into a text input or something like that. Right? What you are typing above, eg 〹 are html entities. I don't expect to see them in data from the form. That's what's bothering me. Data from a form is generally url-encoded, so any numeric representations of characters are generally of the form %0B, ie hex bytes with leading percents. Those the servlet api will deal with form you -- modulo getting the encoding right so it doesn't mis-transcode the bytes. Are you posting the data or is the action a get? What browser? Do you specify a content type for the form? Can you send me a copy of the form? > > those aren't the exact integer, but that is the pattern. > > now the character encoding filter in tomcat 4.0.3 is not doing anything with > these characters because it is reading them one by one '&' '#' '2' '3' '4' > '4' '5' ';' and finding them to be normal characters does not try to convert > them. > > I have then added to the request interception method (doFilter) and added a > method that strip the '&#' and ';' from either end of the number. It then > creates and int out of the reamining string ('23445'). When I cast this int > to a char, it seems to come up with the correct character when I debug. This > is correct right up until I try to sonvert the string to UTF-8 or just enter > it into the database. It then becomes '????'. > > My questions are: > > 1. as I don't have a foregn keyboard, and am entering the characters in > using the Windows Character map; am I entering them in in a form that is not > the same as if someone using a chinese keyboard would enter them? Don't know. Don't have access to Windows at the moment [between jobs and run a MS-free home], and haven't used the character map. What's it do? One trick I have used in the past was to just cut and paste data from a good page in the proper encoding. There are some nice example pages at http://vancouver-webpages.com/multilingual/index.shtml that I've used in the past. But, that's mostly native encodings. There are some native and unicode samples at http://www.unicode.org/iuc/iuc10/ , oddly enough, the Unicode conference announcements are multi-lingual. There is also a decent page at the UN. They have the Universal Declaration of rights in every language known to man (just about literally). http://www.unhchr.ch/udhr/navigate/alpha.htm All the pages are encoded utf-8, if I recall, and you can get a PDF with embedded fonts, so you know if you you are doing looks right. > > I.E. is the encoding different. Given that the java code seems to be ok with > the integer as chars, I am thinking this is not the case. > > 2. Is there something I am doing wrong with the conversion? At the moment I > am doing new String(origString.getBytes(), "UTF-8"); That's a reasonable way to transcode, assuming the Bytes in the string are in your native encoding (ie the JVM's default encoding). But if the String contains ...., '#', '1', '2', '3', '4' ';' and not the UCS2 character for that is 1234, it won't help. Those characters encode the same in just about every encoding. My gut take at the moment, is your problem is still back at the browser. What you are seeing is unlike anything I've run into with internationalized java web apps, including one that supported several multi-byte encodings. Maybe if you explain the whole processing sequence, html, form, request, what the servlet does to get the data (ie is it just getParameter()), I can make some sense of it. Drew -- Drew Sudell [EMAIL PROTECTED] http://www.op.net/~asudell -- To unsubscribe: <mailto:[EMAIL PROTECTED]> For additional commands: <mailto:[EMAIL PROTECTED]> Troubles with the list: <mailto:[EMAIL PROTECTED]>
