Hi, It is a plugin found in src/plugins/parse-html/.
Cheers On Tuesday 07 June 2011 18:01:22 lewis john mcgibbney wrote: > Hi Alex, > > I cannot locate the java file you mention at > org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3... > > Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both > versions above it is identical) it appears that you are right the "double > quotes" for <meta http-equiv....> are accepted whereas 'single quotes' are > not. I would be interested to see what kind of output you get when > nutch-1.2 experiences the type of single quote meta syntax you highlight? > Can you elaborate please... > > If your regex suggestion is working then I would stick with this, however > this is maybe something you wish to raise in JIRA... any comments? > Lewis > > On Tue, Jun 7, 2011 at 4:05 PM, Alex F < > > [email protected]> wrote: > > Hi, > > > > the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is > > not suitable for sites using single quotes for <meta http-equiv....> > > > > Example: <meta http-equiv='Content-Type' content='text/html; > > > > charset=iso-8859-1'> > > > > We experienced a couple of pages with that kind of quotes and Nutch-1.2 > > > > was not able to handle it. > > > > Is there any fallback or would it be good to use the following > > regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" > > (single > > or regular quotes are accepted)? > > > > BR > > > > Alexander Fahlke > > Software Development > > www.informera.de -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

