Hi,

the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
suitable for sites using single quotes for <meta http-equiv....>

  Example: <meta http-equiv='Content-Type' content='text/html;
charset=iso-8859-1'>
  We experienced a couple of pages with that kind of quotes and Nutch-1.2
was not able to handle it.

Is there any fallback or would it be good to use the following
regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single
or regular quotes are accepted)?

BR

Alexander Fahlke
Software Development
www.informera.de

Reply via email to