On Fri, Jul 29, 2011 at 01:58:15PM +0100, Matthew Pocock wrote:
> Hi,
> 
> I've been pulling down pages from wiktionary in a Java application. The
> majority of pages seem to work fine (e.g. http://en.wiktionary.org//wiki/-a).
> I can load them in Java, and if I wget them, I end up with a file containing
> what I'd expect.
> 
> However, some pages seem not to work (e.g.
> http://en.wiktionary.org/wiki/absolute_instrument). In Java, I get a codec
> exception and when using wget, the resulting downloaded file is garbled. I
> think this is because although they claim to be UTF-8 encoded, they are not.
> These pages show up fine in my browser, but it isn't telling me what charset
> it uses to decode the text.

It works perfectly for me.  Maybe your problem is that wgets saves
it as a gzipped filed?

The headers have this in it:
Content-Encoding: gzip
Content-Length: 5486
Content-Type: text/html; charset=UTF-8

And there is nothing wrong with it as far as I can see.


Kurt


_______________________________________________
Wiktionary-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Reply via email to