Hi Dan, please see inline for comments. Regards, Markus -----Original message----- > From:Dan Kinder <[email protected]> > Sent: Monday 16th June 2014 23:32 > To: [email protected] > Subject: Clarifications regarding re-crawl and Nutch2 storage > > Hi there, > > My company currently runs a full-web crawler (focusing on written content > including content from PDFs, word docs, etc. to support our product). It's > fully proprietary (including the indexing solution) and fairly old. > > We're looking to potentially upgrade and I've been reading quite a bit > about Nutch. It seems promising but I have questions I've had trouble > finding answers to in the existing wikis and blogs. My apologies if I just > haven't dug deep enough on these; feel free to point to resources. > > 1) The Nutch examples generally seem to update the link database, generate > new segments, crawl, then repeat. Can this be done continuously and > simultaneously, so that we are constantly using our crawl bandwidth? (I.e. > is there an issue generating new segments while crawls and db updates are > happening?) I wonder this especially because we're interested in keeping as > live a dataset as possible; most of the docs seem to indicate that a large > crawl may take on the order of weeks, and thus a new link may not be > indexed until the following cycle a month or two after we grab or inject it.
It is not recommended to overlap crawls and database updates with Nutch but it is possible, but usually also not required. Even with very large websites you must pay attention to politeness, you cannot or should not do more than a page every few seconds. This still means you can crawl a large amount of pages in a month. It is usually not interesting to re-crawl a specific page more than once a month, except for pages that list new URL's. So once you crawled the entire site, which may take some time, you are good to go. If you design the set up so it does not do large fetches (segments) and does not spend much time updating databases, you can crawl a lot and still be fresh. For example, our site search platform has continuous crawl cycles that take no longer than 15 minutes, this means that new pages are discovered and indexed within 30 minutes, even for sites that have a few million URL's. > 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2 which > allows pluggable backends via Gora. Yet I'm getting the (possibly false) > impression that HDFS/Hadoop is still somehow involved in Nutch 2 (there's > still a crawlDir and such referenced here: > http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most interested in > a Cassandra backend right now). If this is true how does it play in? Is > Hadoop/HDFS used for job distribution and intermediate data while all > permanent data is in Cassandra? Both can run on Hadoop, but Nutch 1.x uses Hadoop sequence files to store data where Nutch 2.x uses Gora to abstract storage. Hadoop's Map/Reduce framework is uses in both versions. Both can run fine but 1.x is considered the main stable distribution and has some more features. At this time 1.x is also still faster than 2.x, but this may not be a problem if your data isn't large. So mention you operate a full-webcrawler, does this mean you have billions of billions of records? I do not know how Nutch 2.x with Cassandra will deal with that, Nutch 1.x can deal with it provided that you have powerful hardware, although you would need that anyway. If you just have a few million, you wouldn't even need Hadoop to distribute your jobs. > 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are there > any controls regarding how often to retry previously fetched links (maybe > depending on their return code, whether they had changes, pagerank, etc.), > and how often to try newly fetched links? My reading so far indicates that > with the default 30-day refresh interval we'll simply try to re-crawl every > single link every interval; if this is true then it seems like we would > often be crawling pages that haven't changed. Nutch allows for pluggable implementations of a fetch schedule. It allows fine grained control over rescheduling behaviour. We ship with a default and also an adaptive scheduler, that one will for example recrawl frequent changing pages more frequently. The downside is that it will also recrawl overview (or hub) pages more frequently. Although they allow you to discover new content, you only need to crawl them once, except for the first overview page that lists very recently added content. But using parsers plugins that can detect such pages and set some values, and a custom fetch schedule, you can solve such problems. > > Thanks! > -dan >

