Hi,

It is a plugin found in src/plugins/parse-html/.

Cheers

On Tuesday 07 June 2011 18:01:22 lewis john mcgibbney wrote:
> Hi Alex,
> 
> I cannot locate the java file you mention at
> org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
> 
> Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both
> versions above it is identical) it appears that you are right the "double
> quotes" for <meta http-equiv....> are accepted whereas 'single quotes' are
> not. I would be interested to see what kind of output you get when
> nutch-1.2 experiences the type of single quote meta syntax you highlight?
> Can you elaborate please...
> 
> If your regex suggestion is working then I would stick with this, however
> this is maybe something you wish to raise in JIRA... any comments?
> Lewis
> 
> On Tue, Jun 7, 2011 at 4:05 PM, Alex F <
> 
> [email protected]> wrote:
> > Hi,
> > 
> > the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is
> > not suitable for sites using single quotes for <meta http-equiv....>
> > 
> >  Example: <meta http-equiv='Content-Type' content='text/html;
> > 
> > charset=iso-8859-1'>
> > 
> >  We experienced a couple of pages with that kind of quotes and Nutch-1.2
> > 
> > was not able to handle it.
> > 
> > Is there any fallback or would it be good to use the following
> > regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> > (single
> > or regular quotes are accepted)?
> > 
> > BR
> > 
> > Alexander Fahlke
> > Software Development
> > www.informera.de

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to