Hi Sergey,
I am not sure where to begin looking. I have removed all of the
configs and put them back in one at a time to no avail.
The regex-urlfilter.txt has the line: -[?*!@=] commented out. I even
commented out: -.*(/[^/]+)/[^/]+\1/[^/]+\1/
Just for chuckles, I commented out:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
In parse-plugins.xml I commented out:
<!--
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
-->
and
<!--
<alias name="parse-html"
extension-id="org.apache.nutch.parse.html.HtmlParser" />
-->
I even reduced it to:
<parse-plugins>
<!-- by default if the mimeType is set to *, or
if it can't be determined, use parse-tika -->
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>
<!-- alias mappings for parse-xxx names to the actual extension
implementation
ids described in each plugin's plugin.xml file -->
<aliases>
<alias name="parse-tika"
extension-id="org.apache.nutch.parse.tika.TikaParser" />
<alias name="parse-ext" extension-id="ExtParser" />
</aliases>
</parse-plugins>
No matter what I do, it seems as though it is NOT using tika to parse
the html files to extract, because it NEVER parses the relative URL's
the same way that the ParserChecker is.