I'm sure that you guys're talking about character set(= character encoding in MIME) in HTTP. I added my comment below. ;) Sung-Gu ----- Original Message ----- Subject: Re: [HttpClient]Encoding > I'll see about changing getResponseBodyAsString() to use the charset from > the content-type (if it exists). I'm up to my ears with day job work right > now, so it'll probably be a while before I can get to it. I think we'll need to support language tags (within the Accept-Language and Content-Language fields) and Accept and Content-Type (for internet media types) at some point. > > People still need to understand (and I'll improve the JavaDoc) that > getResponseBodyAsString() is never really going to be all that useful in the > real world. From HttpClient's perspective the response body is simply a > sequence of bytes, nothing more. It is up to a higher application layer to > actually *interpret* those bytes based on the mime type specified in the > content-type header. > > Marc Saegesser > > > -----Original Message----- > > From: Rapheal Kaplan [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, March 20, 2002 1:53 PM > > To: Jakarta Commons Developers List > > Subject: Re: [HttpClient]Encoding > > > > > > Makes sense to me. Because the encoding is handled in the > > body itself, it > > doesn't necessarily help that much to set the encoding in the > > getResponseBodyAsString method. Also, this kind of means > > that you can't rely > > on the getResponseBodyAsString method for all purposes. > > There needs to be > > some other layer of a client application that manages encoding. > > > > I still see the use of get...AsString, of course. It could > > be an inbetween > > step that is sent to a parser to determine actual encoding, > > but then you > > would need to return to the original byte stream anyway to > > re-string the > > body. Maybe the documentation should reflect this information. > > > > Also, if people start using charset info in the future, it > > would probably > > be nice to provide support. It might be that doing body to > > string conversion > > should be somewhere else in the API. Any ideas? > > > > My first guess would be to have a utility class that can do > > the correct > > encoding, from both the header and maybe even parsing the > > content. However, > > I don't think I am framiliar enough with the API to say decisivly. > > > > I do know that such features might be very useful for some work > > that I need to do in the near future. I am working one > > software that needs > > to interact with several languages with non-latin character sets. In your pre-mail, > For example, if the client is requesting a document written > in Chinese, it > could well use an entirely different encoding. if you want to solve this problem in the only perspective of character encoding, you should consider of the conversion from/to local character set to/from transfer character set in the client/server side. We can go more complicately! If you use mixed non-ascii characters (Korean and Chinese... ), you should provide to handle to bi-directional display for these character sets. Then you should take a two step process for conversion from/to local character set to/from UTF-8? First, convert the local character set to the UCS. Second, convert UCS to UTF-8. How complicated, huh? And one more! Some old clients or servers doesn't support 8 bit transfer encoding like UTF-8. Then what? We should check that the code is valid UTF-8 or not. However, there is an eaiser way to solve this problem. ( I WANT to say this a bit! ^^ ) That's to use "escaped encoding" that includes ASCII character set only. It looks like application/x-www-form-urlencoded for media type in HTML. But it's somewhat different. > > > > - Rapheal Kaplan > > > > > > > > On Wednesday 20 March 2002 14:27, you wrote: > > > I've had to deal with this problem myself. Right now the > > only solution is > > > to use getResponseBody() and convert bytes into a string using the > > > appropriate encoding. I like the idea of having > > getResponseBodyAsString() > > > use the encoding specified in the Content-Type header, but > > the problem is > > > that it still won't be very useful. > > > > > > The vast majority of web servers out there don't include a > > "; charset=" > > > attribute in the content-type header or provide a > > reasonable mechanism for > > > content authors to cause the server to set the attribute > > correctly on a > > > per-file basis. Most pages with non-ISO-LATIN-1 charsets use <META > > > HTTP-EQUIV> tag in the HTML header to specify the page > > encoding. That > > > means you still have to read at least part of the response body (as > > > ISO-LATIN-1) in order to determine the correct encoding. > > > > > > I don't have a problem with changing > > getResponseBodyAsString() to check the > > > content-type header, I just doubt that doing that will make > > it much more > > > useful in the real world. > > > > > > What do others think? > > > > > > Marc Saegesser > > > > > > > >
