Re: Character encoding

2005-04-06 Thread Alain Bench
Hello Georg,

 On Friday, April 1, 2005 at 12:01:15 PM +0200, Georg Bauhaus wrote:

 The apostrophy might have been typed as an accent (acute) really

Most probably the RIGHT SINGLE QUOTATION MARK U+2019, , encoded
in UTF-8, then wrongly seen as being CP-1252. It would look like 
(a circumflex, euro symbol, trademark sign), and once transliterated to
Latin-1 like EUR(tm).


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


RE: Character encoding

2005-04-05 Thread Alan Hunter

The solution is to explicitly set the character encoding to utf-8. I do this
in the aspx file's head section and it works fine. 

This is kinda wierd though as with an aspx file, it seems that dotnet will
always insert this charset header for you by default (you can see this by
running wget in debug mode, withough setting the charset in the head
section). However this does not work when using wget. It does work in normal
browsers though as aspx files with utf-8 chars obvioulsy display fine.

Anyway problem solved, just thought I'd let you know.


-Original Message-
From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
Sent: March 31, 2005 3:19 PM
To: Alan Hunter
Cc: 'wget@sunsite.dk'
Subject: Re: Character encoding


I'm not sure what causes this problem, but I suspect it does not come
from Wget doing something wrong.  That Notepad opens the file
correctly is indicative enough.

Maybe those browsers don't understand UTF-8 (or other) encoding of
Unicode when the file is opened on-disk?


RE: Character encoding

2005-04-01 Thread Alan Hunter
but I suspect it does not come from Wget doing something wrong.

I'm not so sure about that, it displays different output for the same
infile, when only the extension of the infile changes. I tried with the
exact same file spidered three times, only changing the extension between
each spider. 

infile = 69 bytes

.html outfile = 69 bytes
.zzz outfile = 69 bytes
.aspx outfile = 66 bytes

So either it is wget or something screwy with asp.net. As I said, i don't
know the inner workings of either so i'm not sure. I'll try to set up the
example on a public svr in the next few days.

-Original Message-
From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
Sent: March 31, 2005 3:19 PM
To: Alan Hunter
Cc: 'wget@sunsite.dk'
Subject: Re: Character encoding


I'm not sure what causes this problem, but I suspect it does not come
from Wget doing something wrong.  That Notepad opens the file
correctly is indicative enough.

Maybe those browsers don't understand UTF-8 (or other) encoding of
Unicode when the file is opened on-disk?


Character encoding

2005-03-31 Thread Alan Hunter



Hi,
I have a webpage 
that has some html textthat has been pasted from MS Word and the quote 
char ' is a special "type", ie not the ascii one. This char displays fine in 
IE/Firefox. However, when I spider the page with Wget (windows) it encodes this 
character in a funny way e.g. areaâ(tm)s = area's. The spidered html page 
does then not display properly in IE/Firefox as the char is not decoded. 


Is this correct 
behaviour, any idea how to fix? It happens onnumber of 
chars not just the quote, but I use that as an example. I am not an expert on 
character encoding so go easy.

Thanks.


Re: Character encoding

2005-03-31 Thread Hrvoje Niksic
Wget shouldn't alter the page contents, except for converted links.
Is the funny character in places which Wget should know about
(e.g. URLs in links) or in the page text?  Could you page a minimal
excerpt from the page, before and after garbling done by Wget?
Alternately, could you post a URL where we could try this?


RE: Character encoding

2005-03-31 Thread Alan Hunter

Hi, 
Thanks for the reply. It is the page text that is the problem.

When I started to investigate it further I found that it actually only
happens when the page being wgot is a .aspx (.net asp) file. 

I made 3 identical files (as below), one with .html ext, 1 with .aspx ext
and one with .zzz ext (just an unknown filetype under IIS), then wgot each
one and changed the output file to a .html extension. when I open in my
browser both the .html and .zzz files are fine, but the one that came from
the .aspx file has the funny chars. Why this is so I have no idea.

the output file looks fine when i open it in Notepad (ie the quote looks
right), but when i open it in firefox/ie it shows the funny chars (see
below). If I then just save the file in notepad, without changing a thing,
the problem is fixed in firefox/ie.

I am now really really confused :) 

---Input file note the ' char, you'll need to spider
under IIS with a .aspx ext to replicate---

HTML

body

PExample's/P


/body
/HTML

--FIREFOX VIEW SOURCE on output file---

HTML

body

PExampleâEUR(tm)s/P


/body
/HTML


Re: Character encoding

2005-03-31 Thread Hrvoje Niksic
I'm not sure what causes this problem, but I suspect it does not come
from Wget doing something wrong.  That Notepad opens the file
correctly is indicative enough.

Maybe those browsers don't understand UTF-8 (or other) encoding of
Unicode when the file is opened on-disk?