Hi Sol, > +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
Thanks, I've updated the wiki patch to include https as well. How many cycles did you run the crawl? I got 28 pages after 3 cycles starting from http://nutch.apache.org/ ... Best, Sebastian On 11/15/2017 08:22 PM, Sol Lederman wrote: > If I google for site:nutch.apache.org I get ~12,500 results. When I crawl > the site via nutch I get 28 records in the solr index. > > Here's the relevant piece of my regex-urlfilter.txt file. It's just the > default that comes with nutch. > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps| > EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM| > tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > # skip URLs containing certain characters as probable queries, etc. > -[?*!@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > # +. > +^http://([a-z0-9]*\.)*nutch.apache.org/ > > > I'm sure I can find a number of examples of files that should be crawled > and aren't. Here's one example. > > https://nutch.apache.org/javadoc.html has links to a number of > apidocs pages that are picked up by nutch. But, this page, > https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's > referenced like this: > > <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li> > > I wouldn't imagine that relative links would be a problem as other relative > links are handled fine. And, I did click on that link and it doesn't stray > from nutch.apache.org. > > I thought the problem might have to do with http vs. https. So, I changed > the last line of the filter to be this: > > +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/ > > > When I did that then the /miredot/ url got fetched and parsed but the > urls indexed into Solr were the same as before including https. > > What am I missing? > > Thanks. > > Sol >

