Hi Sergey,

I am not sure where to begin looking. I have removed all of the configs and put them back in one at a time to no avail.


The regex-urlfilter.txt has the line: -[?*!@=] commented out. I even commented out: -.*(/[^/]+)/[^/]+\1/[^/]+\1/



Just for chuckles, I commented out: -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$



In parse-plugins.xml I commented out:

<!--
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>

<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
-->

and

<!--
<alias name="parse-html"
            extension-id="org.apache.nutch.parse.html.HtmlParser" />
-->

I even reduced it to:

<parse-plugins>

<!--  by default if the mimeType is set to *, or
        if it can't be determined, use parse-tika -->
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>

<!-- alias mappings for parse-xxx names to the actual extension implementation
    ids described in each plugin's plugin.xml file -->
<aliases>
<alias name="parse-tika"
            extension-id="org.apache.nutch.parse.tika.TikaParser" />
<alias name="parse-ext" extension-id="ExtParser" />
</aliases>

</parse-plugins>


No matter what I do, it seems as though it is NOT using tika to parse the html files to extract, because it NEVER parses the relative URL's the same way that the ParserChecker is.



Reply via email to