Thanks for the response Markus, a few clarification questions below. On Mon, Jun 16, 2014 at 3:01 PM, Markus Jelsma <[email protected]> wrote:
> Hi Dan, please see inline for comments. > > Regards, > Markus > > -----Original message----- > > From:Dan Kinder <[email protected]> > > Sent: Monday 16th June 2014 23:32 > > To: [email protected] > > Subject: Clarifications regarding re-crawl and Nutch2 storage > > > > Hi there, > > > > My company currently runs a full-web crawler (focusing on written content > > including content from PDFs, word docs, etc. to support our product). > It's > > fully proprietary (including the indexing solution) and fairly old. > > > > We're looking to potentially upgrade and I've been reading quite a bit > > about Nutch. It seems promising but I have questions I've had trouble > > finding answers to in the existing wikis and blogs. My apologies if I > just > > haven't dug deep enough on these; feel free to point to resources. > > > > 1) The Nutch examples generally seem to update the link database, > generate > > new segments, crawl, then repeat. Can this be done continuously and > > simultaneously, so that we are constantly using our crawl bandwidth? > (I.e. > > is there an issue generating new segments while crawls and db updates are > > happening?) I wonder this especially because we're interested in keeping > as > > live a dataset as possible; most of the docs seem to indicate that a > large > > crawl may take on the order of weeks, and thus a new link may not be > > indexed until the following cycle a month or two after we grab or inject > it. > > It is not recommended to overlap crawls and database updates with Nutch > but it is possible, but usually also not required. Even with very large > websites you must pay attention to politeness, you cannot or should not do > more than a page every few seconds. This still means you can crawl a large > amount of pages in a month. It is usually not interesting to re-crawl a > specific page more than once a month, except for pages that list new URL's. > So once you crawled the entire site, which may take some time, you are good > to go. If you design the set up so it does not do large fetches (segments) > and does not spend much time updating databases, you can crawl a lot and > still be fresh. For example, our site search platform has continuous crawl > cycles that take no longer than 15 minutes, this means that new pages are > discovered and indexed within 30 minutes, even for sites that have a few > million URL's. > I can see what you mean, but as I said we are crawling the whole internet (as much as we can get). We do pay attention to politeness rules (and try to go faster for sites that set a low crawl-delay), but given the breadth of sites out there we'd like to be using our bandwidth as continuously as possible. Are there any resources out there of people who have done overlapping crawls and DB updates so as to be continuously crawling? You say it's possible but I'm wondering if there are concerns in doing that. > > > 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2 which > > allows pluggable backends via Gora. Yet I'm getting the (possibly false) > > impression that HDFS/Hadoop is still somehow involved in Nutch 2 (there's > > still a crawlDir and such referenced here: > > http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most interested > in > > a Cassandra backend right now). If this is true how does it play in? Is > > Hadoop/HDFS used for job distribution and intermediate data while all > > permanent data is in Cassandra? > > Both can run on Hadoop, but Nutch 1.x uses Hadoop sequence files to store > data where Nutch 2.x uses Gora to abstract storage. Hadoop's Map/Reduce > framework is uses in both versions. Both can run fine but 1.x is considered > the main stable distribution and has some more features. At this time 1.x > is also still faster than 2.x, but this may not be a problem if your data > isn't large. > So mention you operate a full-webcrawler, does this mean you have billions > of billions of records? I do not know how Nutch 2.x with Cassandra will > deal with that, Nutch 1.x can deal with it provided that you have powerful > hardware, although you would need that anyway. If you just have a few > million, you wouldn't even need Hadoop to distribute your jobs. > Yes we have billions of billions of records, so based on what you're saying I would probably explore Nutch 1.x. To clarify, when you say that both Nutch 1.x and 2.x use Hadoop Map/Reduce, you mean that even for 2.x I would need to run a Hadoop cluster that would do the actual crawling in Hadoop map jobs (even if the Map jobs simply talk to Cassandra via GORA), is that the case? > > 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are > there > > any controls regarding how often to retry previously fetched links (maybe > > depending on their return code, whether they had changes, pagerank, > etc.), > > and how often to try newly fetched links? My reading so far indicates > that > > with the default 30-day refresh interval we'll simply try to re-crawl > every > > single link every interval; if this is true then it seems like we would > > often be crawling pages that haven't changed. > > Nutch allows for pluggable implementations of a fetch schedule. It allows > fine grained control over rescheduling behaviour. We ship with a default > and also an adaptive scheduler, that one will for example recrawl frequent > changing pages more frequently. The downside is that it will also recrawl > overview (or hub) pages more frequently. Although they allow you to > discover new content, you only need to crawl them once, except for the > first overview page that lists very recently added content. But using > parsers plugins that can detect such pages and set some values, and a > custom fetch schedule, you can solve such problems. > Thanks I didn't see the AdaptiveFetchSchedule (found the docs here: http://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/crawl/AdaptiveFetchSchedule.html), that does mostly answer my question. I also just found a page that helped me regarding the "failed" fetches: http://wiki.apache.org/nutch/CrawlDatumStates. It's a little old but I'm assuming the behavior hasn't changed much. Seems like for 404, 500, or other non-successful fetches it will stop trying after a few times. > > > > > Thanks! > > -dan > > >

