If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
the site via nutch I get 28 records in the solr index.

Here's the relevant piece of my regex-urlfilter.txt file. It's just the
default that comes with nutch.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
# +.
+^http://([a-z0-9]*\.)*nutch.apache.org/


I'm sure I can find a number of examples of files that should be crawled
and aren't. Here's one example.

https://nutch.apache.org/javadoc.html has links to a number of
apidocs pages that are picked up by nutch. But, this page,
https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
referenced like this:

    <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>

I wouldn't imagine that relative links would be a problem as other relative
links are handled fine. And, I did click on that link and it doesn't stray
from nutch.apache.org.

I thought the problem might have to do with http vs. https. So, I changed
the last line of the filter to be this:

+^(http|https)://([a-z0-9]*\.)*nutch.apache.org/


When I did that then the /miredot/ url got fetched and parsed but the
urls indexed into Solr were the same as before including https.

What am I missing?

Thanks.

Sol

Reply via email to