Update: the problem occurs only in the TikaParser! Ideas? Markus
-----Original message----- > From:Markus Jelsma <[email protected]> > Sent: Thursday 26th October 2017 17:53 > To: [email protected]; User <[email protected]> > Subject: RE: Wrong encoding > > Note: setting parser.character.encoding.default to UTF-8 doesn't work. > > Many thanks, > Markus > > -----Original message----- > > From:Markus Jelsma <[email protected]> > > Sent: Thursday 26th October 2017 17:33 > > To: User <[email protected]> > > Subject: Wrong encoding > > > > Hello, > > > > I have this URL that says according to parsechecker it has > > Content-Type=text/html; charset=windows-1252, which is incorrect. There is > > also Content-Type=text/html; charset=utf-8 in the metadata, which i do find > > in the HTML, at least i see <meta charset="utf-8">. This is Nutch > > 1.14-SNAPSHOT. > > > > But anyway, the text extracted is completely messed up, not all, but most > > accents are unreadable. > > > > No idea, do you have any? > > > > Many thanks, > > Markus > > > > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser > > >

