Ah, thanks! That was indeed my problem. I now look in the headers and if
they contain "Content-Encoding: gzip", I unzip the content. Not sure how I
could be silly enough to miss that.

Matthew

On 29 July 2011 14:15, Daniel Zahn <[email protected]> wrote:

> Hi,
>
> when i wget the page "absolute_instrument"  i get a gzipped version of it.
>
> file absolute_instrument
> absolute_instrument: gzip compressed data, from Unix
>
> as opposed to the example "-a", which is not gzipped, but plain HTML right
> away.
>
> Hence, the former one might look garbled to you, unless you use "gunzip"
> first to remove the compression. (If gzip complains about "unknown suffix"
> rename it to *.gz).
> Then you should get regular HTML.
>
> Here's an example on how to remove gzip in Java:
>
> http://code.hammerpig.com/how-to-gunzip-files-with-java.html
>
> I am not sure however how the server-side decides whether to compress it or
> not.
> Hope that helps anyways,
>
> Daniel
>
> On Fri, Jul 29, 2011 at 2:58 PM, Matthew Pocock <
> [email protected]> wrote:
>
> > Hi,
> >
> > I've been pulling down pages from wiktionary in a Java application. The
> > majority of pages seem to work fine (e.g.
> > http://en.wiktionary.org//wiki/-a).
> > I can load them in Java, and if I wget them, I end up with a file
> > containing
> > what I'd expect.
> >
> > However, some pages seem not to work (e.g.
> > http://en.wiktionary.org/wiki/absolute_instrument). In Java, I get a
> codec
> > exception and when using wget, the resulting downloaded file is garbled.
> I
> > think this is because although they claim to be UTF-8 encoded, they are
> > not.
> > These pages show up fine in my browser, but it isn't telling me what
> > charset
> > it uses to decode the text.
> >
> > Is this a known issue? Is there any workaround for this? Can it be fixed
> > server-side?
> >
> > Thanks,
> >
> > Matthew
> >
> > --
> > Dr Matthew Pocock
> > Visitor, School of Computing Science, Newcastle University
> > mailto: [email protected]
> > gchat: [email protected]
> > msn: [email protected]
> > irc.freenode.net: drdozer
> > tel: (0191) 2566550
> > mob: +447535664143
> > _______________________________________________
> > Wiktionary-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
> >
>
>
>
> --
> --
> Daniel Zahn <[email protected] <[email protected]>>
> _______________________________________________
> Wiktionary-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>



-- 
Dr Matthew Pocock
Visitor, School of Computing Science, Newcastle University
mailto: [email protected]
gchat: [email protected]
msn: [email protected]
irc.freenode.net: drdozer
tel: (0191) 2566550
mob: +447535664143
_______________________________________________
Wiktionary-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Reply via email to