Hi Sol,

> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/

Thanks, I've updated the wiki patch to include https as well.


How many cycles did you run the crawl? I got 28 pages after 3 cycles
starting from http://nutch.apache.org/ ...

Best,
Sebastian


On 11/15/2017 08:22 PM, Sol Lederman wrote:
> If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
> the site via nutch I get 28 records in the solr index.
> 
> Here's the relevant piece of my regex-urlfilter.txt file. It's just the
> default that comes with nutch.
> 
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
> EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # accept anything else
> # +.
> +^http://([a-z0-9]*\.)*nutch.apache.org/
> 
> 
> I'm sure I can find a number of examples of files that should be crawled
> and aren't. Here's one example.
> 
> https://nutch.apache.org/javadoc.html has links to a number of
> apidocs pages that are picked up by nutch. But, this page,
> https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
> referenced like this:
> 
>     <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>
> 
> I wouldn't imagine that relative links would be a problem as other relative
> links are handled fine. And, I did click on that link and it doesn't stray
> from nutch.apache.org.
> 
> I thought the problem might have to do with http vs. https. So, I changed
> the last line of the filter to be this:
> 
> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
> 
> 
> When I did that then the /miredot/ url got fetched and parsed but the
> urls indexed into Solr were the same as before including https.
> 
> What am I missing?
> 
> Thanks.
> 
> Sol
> 

Reply via email to