Re: Crawling and parsing

Michael Kelleher Thu, 24 Nov 2011 16:03:44 -0800

Hi Sergey,

I am not sure where to begin looking. I have removed all of theconfigs and put them back in one at a time to no avail.

The regex-urlfilter.txt has the line: -[?*!@=] commented out. I evencommented out: -.*(/[^/]+)/[^/]+\1/[^/]+\1/

Just for chuckles, I commented out:-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$




In parse-plugins.xml I commented out:

<!--
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>

<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
-->

and

<!--
<alias name="parse-html"
            extension-id="org.apache.nutch.parse.html.HtmlParser" />
-->

I even reduced it to:

<parse-plugins>

<!--  by default if the mimeType is set to *, or
        if it can't be determined, use parse-tika -->
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>

<!-- alias mappings for parse-xxx names to the actual extensionimplementation

    ids described in each plugin's plugin.xml file -->
<aliases>
<alias name="parse-tika"
            extension-id="org.apache.nutch.parse.tika.TikaParser" />
<alias name="parse-ext" extension-id="ExtParser" />
</aliases>

</parse-plugins>

No matter what I do, it seems as though it is NOT using tika to parsethe html files to extract, because it NEVER parses the relative URL'sthe same way that the ParserChecker is.

Re: Crawling and parsing

Reply via email to