For what it's worth, as a result of sending this message I have been able to 
advance a little bit in this area.

It had seemed to me that new, relevant URLs were indeed being fetched and 
parsed, so it was confusing as to why Solr had such a small number of documents 
in its index. I speculated that Solr must be deduplicating regular results.

After diving into the END of my Nutch process logs, and testing this 
"correct-looking" URLs, I see that I may have well been mistaken. The URLs 
frequently look like this:

http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name.html
http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name2.html
http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name3.html

What we should be noticing here is the repetition of 
"another-section/another-section". At some point Nutch hits an error page. The 
server is erroneously return an HTTP 200 code, instead of an HTTP 404 "Not 
Found" code, and loading, basically, a broken page.

This page has a number of general elements, such as a list of recent news 
articles, that exist on all pages of the site and contains an odious feature 
wherein the URLs of the news articles seem to just be appended onto the current 
page, thereby generating more URLs to crawl, all of which are identical.

I know that Nutch is configured to filter out URL segments that repeat 3 or 
more times, but in this case we're already nearing 500,000 URLs to crawl.

I appreciate that the website in question is for all intents and purposes 
"broken", and I'll do my best, but I can't rely on them to fix it. Is there a 
better methodology for identifying erroneous URLs? Perhaps it can be de-duped 
in the Parsing phase, or maybe Nutch could see that all the CSS, JS, images, 
and stuff are 404'ing out and somehow "guess" that this is a bad page?

Thanks!
Craig

Reply via email to