Hi again Craig, There is a deduplicator in Nutch but it won't prevent you from crawling these URLs infinitely. One option would be to change the URLFilters / Normalisers so that they deal with the repetition of two elements in the path.
How do you run your crawl BTW? Do you use the crawl script? On 9 July 2014 23:44, Craig Leinoff <[email protected]> wrote: > For what it's worth, as a result of sending this message I have been able > to advance a little bit in this area. > > It had seemed to me that new, relevant URLs were indeed being fetched and > parsed, so it was confusing as to why Solr had such a small number of > documents in its index. I speculated that Solr must be deduplicating > regular results. > > After diving into the END of my Nutch process logs, and testing this > "correct-looking" URLs, I see that I may have well been mistaken. The URLs > frequently look like this: > > > http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name.html > > http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name2.html > > http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name3.html > > What we should be noticing here is the repetition of > "another-section/another-section". At some point Nutch hits an error page. > The server is erroneously return an HTTP 200 code, instead of an HTTP 404 > "Not Found" code, and loading, basically, a broken page. > > This page has a number of general elements, such as a list of recent news > articles, that exist on all pages of the site and contains an odious > feature wherein the URLs of the news articles seem to just be appended onto > the current page, thereby generating more URLs to crawl, all of which are > identical. > > I know that Nutch is configured to filter out URL segments that repeat 3 > or more times, but in this case we're already nearing 500,000 URLs to crawl. > > I appreciate that the website in question is for all intents and purposes > "broken", and I'll do my best, but I can't rely on them to fix it. Is there > a better methodology for identifying erroneous URLs? Perhaps it can be > de-duped in the Parsing phase, or maybe Nutch could see that all the CSS, > JS, images, and stuff are 404'ing out and somehow "guess" that this is a > bad page? > > Thanks! > Craig > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

