Re: Nutch local: large crawls, extremely slow, small solr index

Craig Leinoff Wed, 09 Jul 2014 15:45:11 -0700

For what it's worth, as a result of sending this message I have been able to 
advance a little bit in this area.

It had seemed to me that new, relevant URLs were indeed being fetched and
parsed, so it was confusing as to why Solr had such a small number of documents
in its index. I speculated that Solr must be deduplicating regular results.

After diving into the END of my Nutch process logs, and testing this
"correct-looking" URLs, I see that I may have well been mistaken. The URLs
frequently look like this:

http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name.html
http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name2.html
http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name3.html

What we should be noticing here is the repetition of
"another-section/another-section". At some point Nutch hits an error page. The
server is erroneously return an HTTP 200 code, instead of an HTTP 404 "Not
Found" code, and loading, basically, a broken page.

This page has a number of general elements, such as a list of recent news
articles, that exist on all pages of the site and contains an odious feature
wherein the URLs of the news articles seem to just be appended onto the current
page, thereby generating more URLs to crawl, all of which are identical.

I know that Nutch is configured to filter out URL segments that repeat 3 or
more times, but in this case we're already nearing 500,000 URLs to crawl.

I appreciate that the website in question is for all intents and purposes
"broken", and I'll do my best, but I can't rely on them to fix it. Is there a
better methodology for identifying erroneous URLs? Perhaps it can be de-duped
in the Parsing phase, or maybe Nutch could see that all the CSS, JS, images,
and stuff are 404'ing out and somehow "guess" that this is a bad page?

Thanks!
Craig

Re: Nutch local: large crawls, extremely slow, small solr index

Reply via email to