If I google for site:nutch.apache.org I get ~12,500 results. When I crawl the site via nutch I get 28 records in the solr index.
Here's the relevant piece of my regex-urlfilter.txt file. It's just the default that comes with nutch. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps| EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM| tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else # +. +^http://([a-z0-9]*\.)*nutch.apache.org/ I'm sure I can find a number of examples of files that should be crawled and aren't. Here's one example. https://nutch.apache.org/javadoc.html has links to a number of apidocs pages that are picked up by nutch. But, this page, https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's referenced like this: <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li> I wouldn't imagine that relative links would be a problem as other relative links are handled fine. And, I did click on that link and it doesn't stray from nutch.apache.org. I thought the problem might have to do with http vs. https. So, I changed the last line of the filter to be this: +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/ When I did that then the /miredot/ url got fetched and parsed but the urls indexed into Solr were the same as before including https. What am I missing? Thanks. Sol

