Hi there

I have been dealing with this problem for the last few days

I am definately making progress now.

I found that the <@ page > tag actually works

the form receives information correctly, the only problem now is that the character 
put when converted to rgular java character set, go in correctly, but the utf-8 
representation of these characters are incorrect.

Obviously this is due to the fact that the character I am entering in do not have the 
same character representaion in UTF-8 as they do in ISO-8859-1.

so

亞乽乨亪仮 gets converted to 亞乽乨亪仮 i UTF-8.

I understand that the above chinese character will not make sense I just picked them 
out of the windows character set.

Now for the question.

are the chracter I enter in from the Windows Character Set program, the same as the 
ones that would be entered in from a native chinese keyboard.

If so, why is the conversion not working,

If not, where can I get a program that will emulate the chracters that would be 
entered, so as I can see if the problem lies in Windows or with my conversions. 

My ultimate aim is to create in which I can set the language on the pages, but have 
the code itself not have to be changed. This app would need to work in Thai, Chinese 
as well as Western European Languages. My database is set to use UTF-8.

Any further help would be greatly appreciated


Regards

Steve Vanspall

PS: Surely there are application that have had to deal with this before?????


-----Original Message-----
From: Andrew B. Sudell [mailto:[EMAIL PROTECTED]]
Sent: Monday, 15 April 2002 10:57 AM
To: [EMAIL PROTECTED]
Subject: Internatinalisation Question



Steve:

I noticed your thread on tomcat-user.  Something about it seemed
wrong to me.  But I didn't but in with an answer, as I can't quite put 
my finger on it.  Mind if I ask a few questions?  I've got a hunch
there's something real important that's been left unsaid.  

Steve Vanspall writes:
 > Ok having tested a bit more, I think I can give a clearer description of my
 > problem.
 > 
 > I am currently in the process of making my application multilingual.
 > 
 > I have succesfully altered my database to be such, and it uses UTF-8
 > character set now.
 > 
 > I have changed the meta-inf tag to set the charset to UTF-8.
 > 
 > Retrieving information from the database seems to be ok, however, all the
 > pages have forms for entering/altering data. If I enter foreign characters
 > into the form the are received in the database as a string of HTML style
 > character codes.
 > 
 > e.g. &#23445;&#34259;&#54301;

That's were I start feeling things are weird.  I'm assuming here you
are entering data into a html form, ie into a text input or something
like that.  Right?

What you are typing above, eg &#12345; are html entities.  I don't
expect to see them in data from the form.  That's what's bothering
me. Data from a form is generally url-encoded, so any numeric
representations of characters are generally of the form %0B, ie hex
bytes with leading percents.  Those the servlet api will deal with form 
you -- modulo getting the encoding right so it doesn't mis-transcode
the bytes.

Are you posting the data or is the action a get?

What browser?

Do you specify a content type for the form?

Can you send me a copy of the form?

 > 
 > those aren't the exact integer, but that is the pattern.
 > 
 > now the character encoding filter in tomcat 4.0.3 is not doing anything with
 > these characters because it is reading them one by one '&' '#' '2' '3' '4'
 > '4' '5' ';' and finding them to be normal characters does not try to convert
 > them.
 > 
 > I have then added to the request interception method (doFilter) and added a
 > method that strip the '&#' and ';' from either end of the number. It then
 > creates and int out of the reamining string ('23445'). When I cast this int
 > to a char, it seems to come up with the correct character when I debug. This
 > is correct right up until I try to sonvert the string to UTF-8 or just enter
 > it into the database. It then becomes '????'.
 > 
 > My questions are:
 > 
 > 1. as I don't have a foregn keyboard, and am entering the characters in
 > using the Windows Character map; am I entering them in in a form that is not
 > the same as if someone using a chinese keyboard would enter them?

Don't know.  Don't have access to Windows at the moment [between jobs
and run a MS-free home], and haven't used the character map.  What's
it do?

One trick I have used in the past was to just cut and paste data from
a good page in the proper encoding.  There are some nice example pages 
at http://vancouver-webpages.com/multilingual/index.shtml that I've
used in the past.  But, that's mostly native encodings.  There are
some native and unicode samples at http://www.unicode.org/iuc/iuc10/ , 
oddly enough, the Unicode conference announcements are multi-lingual.
There is also  a decent page at the UN.  They have the Universal
Declaration of rights in every language known to man (just about literally).
http://www.unhchr.ch/udhr/navigate/alpha.htm  All the pages are
encoded utf-8, if I recall, and you can get a PDF with embedded fonts,
so you know if you you are doing looks right.

 > 
 > I.E. is the encoding different. Given that the java code seems to be ok with
 > the integer as chars, I am thinking this is not the case.
 > 
 > 2. Is there something I am doing wrong with the conversion? At the moment I
 > am doing new String(origString.getBytes(), "UTF-8");

That's a reasonable way to transcode, assuming the Bytes in the string 
are in your native encoding (ie the JVM's default encoding).  But if
the String contains ...., '#', '1', '2', '3', '4' ';' and not the UCS2 
character for that is 1234, it won't help.  Those characters encode
the same in just about every encoding.

My gut take at the moment, is your problem is still back at the
browser. What you are seeing is unlike anything I've run into with
internationalized java web apps, including one that supported several
multi-byte encodings.

Maybe if you explain the whole processing sequence, html, form,
request, what the servlet does to get the data (ie is it just
getParameter()), I can make some sense of it.

Drew

-- 
        Drew Sudell     [EMAIL PROTECTED]      http://www.op.net/~asudell


--
To unsubscribe:   <mailto:[EMAIL PROTECTED]>
For additional commands: <mailto:[EMAIL PROTECTED]>
Troubles with the list: <mailto:[EMAIL PROTECTED]>

Reply via email to