Hi there, My company currently runs a full-web crawler (focusing on written content including content from PDFs, word docs, etc. to support our product). It's fully proprietary (including the indexing solution) and fairly old.
We're looking to potentially upgrade and I've been reading quite a bit about Nutch. It seems promising but I have questions I've had trouble finding answers to in the existing wikis and blogs. My apologies if I just haven't dug deep enough on these; feel free to point to resources. 1) The Nutch examples generally seem to update the link database, generate new segments, crawl, then repeat. Can this be done continuously and simultaneously, so that we are constantly using our crawl bandwidth? (I.e. is there an issue generating new segments while crawls and db updates are happening?) I wonder this especially because we're interested in keeping as live a dataset as possible; most of the docs seem to indicate that a large crawl may take on the order of weeks, and thus a new link may not be indexed until the following cycle a month or two after we grab or inject it. 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2 which allows pluggable backends via Gora. Yet I'm getting the (possibly false) impression that HDFS/Hadoop is still somehow involved in Nutch 2 (there's still a crawlDir and such referenced here: http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most interested in a Cassandra backend right now). If this is true how does it play in? Is Hadoop/HDFS used for job distribution and intermediate data while all permanent data is in Cassandra? 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are there any controls regarding how often to retry previously fetched links (maybe depending on their return code, whether they had changes, pagerank, etc.), and how often to try newly fetched links? My reading so far indicates that with the default 30-day refresh interval we'll simply try to re-crawl every single link every interval; if this is true then it seems like we would often be crawling pages that haven't changed. Thanks! -dan

