Thanks, interesting to hear! Nutch has no features for spider trap detection (yet). But you can get a long way using complex generic regular expressions, and make sure you get rid of almost everything that looks like a calendar feature. Use the existing urlfilter-regex plugin for this.
Nutch already has a feature that allows you to remap hosts using the urlnormalizer-host plugin but detection code that generates such a config file has not been added yet. Also, the urlnormalizer-plugin won't work at a very large scale because the list of remapped hosts (which is no more than a Map<String, String>) would eat away your memory. Changing the code to use an FST would make sense as it compresses the data massively, but still, you won't like 100+ MB in each mapper task. There is also a problem of unknown hosts, they are retried so this is a problem. You can work around this using the domainblacklist-urlfilter plugin, but again, this is also a manual task to fill. It uses a Set<String> inside so it has the same issues as i described above. The hostdb tool (uncommitted code) can be used to automatically fill the config file for this plugin. Markus -----Original message----- > From:Dan Kinder <[email protected]> > Sent: Tuesday 17th June 2014 19:24 > To: [email protected] > Subject: Re: Clarifications regarding re-crawl and Nutch2 storage > > Thanks again for the quick response, see inline. > > On Mon, Jun 16, 2014 at 4:08 PM, Markus Jelsma <[email protected]> > wrote: > > > Hi Dan, see inline again. > > Markus > > > > -----Original message----- > > > From:Dan Kinder <[email protected]> > > > Sent: Tuesday 17th June 2014 0:32 > > > To: [email protected] > > > Subject: Re: Clarifications regarding re-crawl and Nutch2 storage > > > > > > Thanks for the response Markus, a few clarification questions below. > > > > > > On Mon, Jun 16, 2014 at 3:01 PM, Markus Jelsma < > > [email protected]> > > > wrote: > > > > > > > Hi Dan, please see inline for comments. > > > > > > > > Regards, > > > > Markus > > > > > > > > -----Original message----- > > > > > From:Dan Kinder <[email protected]> > > > > > Sent: Monday 16th June 2014 23:32 > > > > > To: [email protected] > > > > > Subject: Clarifications regarding re-crawl and Nutch2 storage > > > > > > > > > > Hi there, > > > > > > > > > > My company currently runs a full-web crawler (focusing on written > > content > > > > > including content from PDFs, word docs, etc. to support our product). > > > > It's > > > > > fully proprietary (including the indexing solution) and fairly old. > > > > > > > > > > We're looking to potentially upgrade and I've been reading quite a > > bit > > > > > about Nutch. It seems promising but I have questions I've had trouble > > > > > finding answers to in the existing wikis and blogs. My apologies if I > > > > just > > > > > haven't dug deep enough on these; feel free to point to resources. > > > > > > > > > > 1) The Nutch examples generally seem to update the link database, > > > > generate > > > > > new segments, crawl, then repeat. Can this be done continuously and > > > > > simultaneously, so that we are constantly using our crawl bandwidth? > > > > (I.e. > > > > > is there an issue generating new segments while crawls and db > > updates are > > > > > happening?) I wonder this especially because we're interested in > > keeping > > > > as > > > > > live a dataset as possible; most of the docs seem to indicate that a > > > > large > > > > > crawl may take on the order of weeks, and thus a new link may not be > > > > > indexed until the following cycle a month or two after we grab or > > inject > > > > it. > > > > > > > > It is not recommended to overlap crawls and database updates with Nutch > > > > but it is possible, but usually also not required. Even with very large > > > > websites you must pay attention to politeness, you cannot or should > > not do > > > > more than a page every few seconds. This still means you can crawl a > > large > > > > amount of pages in a month. It is usually not interesting to re-crawl a > > > > specific page more than once a month, except for pages that list new > > URL's. > > > > So once you crawled the entire site, which may take some time, you are > > good > > > > to go. If you design the set up so it does not do large fetches > > (segments) > > > > and does not spend much time updating databases, you can crawl a lot > > and > > > > still be fresh. For example, our site search platform has continuous > > crawl > > > > cycles that take no longer than 15 minutes, this means that new pages > > are > > > > discovered and indexed within 30 minutes, even for sites that have a > > few > > > > million URL's. > > > > > > > > > > I can see what you mean, but as I said we are crawling the whole internet > > > (as much as we can get). We do pay attention to politeness rules (and try > > > to go faster for sites that set a low crawl-delay), but given the breadth > > > of sites out there we'd like to be using our bandwidth as continuously as > > > possible. > > > > > > Are there any resources out there of people who have done overlapping > > > crawls and DB updates so as to be continuously crawling? You say it's > > > possible but I'm wondering if there are concerns in doing that. > > > > I don't know of any resources that discuss this aspect of Nutch. Doing > > overlapping database updates is tricky because you need to plan it with the > > time it takes to do one cycle. you can generate more segments and continue > > crawling. The generate tool has a feature that makes sure the next generate > > round (without a update) does not generate the same records. The DB has a > > state, if you generate two segments without updating you will get the same > > segments. > > > > It would need careful planning and trial and error to tune it properly. It > > would be hard to maximize bandwidth continuously, but you can use tricks > > such as partitioning the DB and have multiple crawlers run at the same > > time. An idea would be to have one crawler do .com, another one .de, .jp, > > .nl, one for .net and .org etc. You already have the data so you should be > > able to come up with an evenly spread list of TLD's or learn along the way. > > Both versions of Nutch allow you to point to either a path on the file > > system (1.x) or an address to a database. > > > > > > > > > > > > > > 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2 > > which > > > > > allows pluggable backends via Gora. Yet I'm getting the (possibly > > false) > > > > > impression that HDFS/Hadoop is still somehow involved in Nutch 2 > > (there's > > > > > still a crawlDir and such referenced here: > > > > > http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most > > interested > > > > in > > > > > a Cassandra backend right now). If this is true how does it play in? > > Is > > > > > Hadoop/HDFS used for job distribution and intermediate data while all > > > > > permanent data is in Cassandra? > > > > > > > > Both can run on Hadoop, but Nutch 1.x uses Hadoop sequence files to > > store > > > > data where Nutch 2.x uses Gora to abstract storage. Hadoop's Map/Reduce > > > > framework is uses in both versions. Both can run fine but 1.x is > > considered > > > > the main stable distribution and has some more features. At this time > > 1.x > > > > is also still faster than 2.x, but this may not be a problem if your > > data > > > > isn't large. > > > > So mention you operate a full-webcrawler, does this mean you have > > billions > > > > of billions of records? I do not know how Nutch 2.x with Cassandra will > > > > deal with that, Nutch 1.x can deal with it provided that you have > > powerful > > > > hardware, although you would need that anyway. If you just have a few > > > > million, you wouldn't even need Hadoop to distribute your jobs. > > > > > > > > > > Yes we have billions of billions of records, so based on what you're > > saying > > > I would probably explore Nutch 1.x. > > > > > > To clarify, when you say that both Nutch 1.x and 2.x use Hadoop > > Map/Reduce, > > > you mean that even for 2.x I would need to run a Hadoop cluster that > > would > > > do the actual crawling in Hadoop map jobs (even if the Map jobs simply > > talk > > > to Cassandra via GORA), is that the case? > > > > Well yes, with that amount of records you will need a cluster for sure. > > Both versions of Nutch rely on map/reduce jobs, they read data in the map > > phase and write data after the reduce phase. 1.x uses sequence files and > > 2.x Gora and the selected back end. Nutch 1.x does the actual fetching of > > URL's in mappers and 2.x does it in the reducers. With very large jobs, 2.x > > has an advantage because of shuffling of data between map and reduce. You > > can work around that by doing many smaller jobs. > > > > > > > > > > > > > 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are > > > > there > > > > > any controls regarding how often to retry previously fetched links > > (maybe > > > > > depending on their return code, whether they had changes, pagerank, > > > > etc.), > > > > > and how often to try newly fetched links? My reading so far indicates > > > > that > > > > > with the default 30-day refresh interval we'll simply try to re-crawl > > > > every > > > > > single link every interval; if this is true then it seems like we > > would > > > > > often be crawling pages that haven't changed. > > > > > > > > Nutch allows for pluggable implementations of a fetch schedule. It > > allows > > > > fine grained control over rescheduling behaviour. We ship with a > > default > > > > and also an adaptive scheduler, that one will for example recrawl > > frequent > > > > changing pages more frequently. The downside is that it will also > > recrawl > > > > overview (or hub) pages more frequently. Although they allow you to > > > > discover new content, you only need to crawl them once, except for the > > > > first overview page that lists very recently added content. But using > > > > parsers plugins that can detect such pages and set some values, and a > > > > custom fetch schedule, you can solve such problems. > > > > > > > > > > Thanks I didn't see the AdaptiveFetchSchedule (found the docs here: > > > > > http://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/crawl/AdaptiveFetchSchedule.html > > ), > > > that does mostly answer my question. > > > > > > I also just found a page that helped me regarding the "failed" fetches: > > > http://wiki.apache.org/nutch/CrawlDatumStates. It's a little old but I'm > > > assuming the behavior hasn't changed much. Seems like for 404, 500, or > > > other non-successful fetches it will stop trying after a few times. > > > > Nutch will in the end always retry, everything. Unless you make some > > adaptation to not to do it. > > > > I am curious, if you craw so much, you cannot manually maintain rules > > per-site. So how do you currently deal with stuff like crawler traps and > > duplicate hosts? We have found that crawler traps are an interesting > > problem, you can either cluster them and classify those clusters or simply > > limit the depth from the / of a website. But if you limit the depth, you > > will never discover, for example, very old forum posts that are hidden at > > the 256th page of a thread overview. And how do you currently deal with > > duplicate hosts? Many websites have the typical problem of www. and > > non-www. addressess, for large websites this means a million additional > > useless records and data etc. Also, many websites have more interesting > > hostname duplicates. We have seen many sites having a dozen different > > hostnames for the same content, some crazy webmasters even (maybe > > deliberately) generate thousands! We have also seen that many adult > > websites generate thousands of hostnames for the same content. > > > > We have seen that addressing these horrible issues of the web solved so > > many problems that so much less IO, CPU, RAM and bandwidth is being wasted. > > Maximising bandwidth would be the last on my list, because what is the use > > if you download so much rubbish. > > > > Regarding duplicate hosts: we have found this to be a problem particularly > with subdomains (no so much with TLDs). We handle this basically by giving > fair crawl coverage at a TLD level. Think of it like handling the subdomain > as just a part of the path. This disadvantages us at covering sites with > many paths and subdomains but helps us not get trapped in sites with > millions of subdomains. We also do have some intelligent rules that try to > figure out if a domain and it's subdomain have all the same path content > (ex. the www. and non-www. case). > > Regarding crawler traps, that's much harder as you say, we're still trying > to mitigate that difficulty. > > Does Nutch do anything out of the box to handle these things? > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > -dan > > > > > > > > > > > > > > >

