Ah, thanks! That was indeed my problem. I now look in the headers and if they contain "Content-Encoding: gzip", I unzip the content. Not sure how I could be silly enough to miss that.
Matthew On 29 July 2011 14:15, Daniel Zahn <[email protected]> wrote: > Hi, > > when i wget the page "absolute_instrument" i get a gzipped version of it. > > file absolute_instrument > absolute_instrument: gzip compressed data, from Unix > > as opposed to the example "-a", which is not gzipped, but plain HTML right > away. > > Hence, the former one might look garbled to you, unless you use "gunzip" > first to remove the compression. (If gzip complains about "unknown suffix" > rename it to *.gz). > Then you should get regular HTML. > > Here's an example on how to remove gzip in Java: > > http://code.hammerpig.com/how-to-gunzip-files-with-java.html > > I am not sure however how the server-side decides whether to compress it or > not. > Hope that helps anyways, > > Daniel > > On Fri, Jul 29, 2011 at 2:58 PM, Matthew Pocock < > [email protected]> wrote: > > > Hi, > > > > I've been pulling down pages from wiktionary in a Java application. The > > majority of pages seem to work fine (e.g. > > http://en.wiktionary.org//wiki/-a). > > I can load them in Java, and if I wget them, I end up with a file > > containing > > what I'd expect. > > > > However, some pages seem not to work (e.g. > > http://en.wiktionary.org/wiki/absolute_instrument). In Java, I get a > codec > > exception and when using wget, the resulting downloaded file is garbled. > I > > think this is because although they claim to be UTF-8 encoded, they are > > not. > > These pages show up fine in my browser, but it isn't telling me what > > charset > > it uses to decode the text. > > > > Is this a known issue? Is there any workaround for this? Can it be fixed > > server-side? > > > > Thanks, > > > > Matthew > > > > -- > > Dr Matthew Pocock > > Visitor, School of Computing Science, Newcastle University > > mailto: [email protected] > > gchat: [email protected] > > msn: [email protected] > > irc.freenode.net: drdozer > > tel: (0191) 2566550 > > mob: +447535664143 > > _______________________________________________ > > Wiktionary-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l > > > > > > -- > -- > Daniel Zahn <[email protected] <[email protected]>> > _______________________________________________ > Wiktionary-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l > -- Dr Matthew Pocock Visitor, School of Computing Science, Newcastle University mailto: [email protected] gchat: [email protected] msn: [email protected] irc.freenode.net: drdozer tel: (0191) 2566550 mob: +447535664143 _______________________________________________ Wiktionary-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
