Hi, the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not suitable for sites using single quotes for <meta http-equiv....>
Example: <meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1'> We experienced a couple of pages with that kind of quotes and Nutch-1.2 was not able to handle it. Is there any fallback or would it be good to use the following regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single or regular quotes are accepted)? BR Alexander Fahlke Software Development www.informera.de

