RE: Clarifications regarding re-crawl and Nutch2 storage

Markus Jelsma Mon, 16 Jun 2014 15:02:20 -0700

Hi Dan, please see inline for comments.

Regards,
Markus
 
-----Original message-----
> From:Dan Kinder <[email protected]>
> Sent: Monday 16th June 2014 23:32
> To: [email protected]
> Subject: Clarifications regarding re-crawl and Nutch2 storage
> 
> Hi there,
> 
> My company currently runs a full-web crawler (focusing on written content
> including content from PDFs, word docs, etc. to support our product). It's
> fully proprietary (including the indexing solution) and fairly old.
> 
> We're looking to potentially upgrade and I've been reading quite a bit
> about Nutch. It seems promising but I have questions I've had trouble
> finding answers to in the existing wikis and blogs. My apologies if I just
> haven't dug deep enough on these; feel free to point to resources.
> 
> 1) The Nutch examples generally seem to update the link database, generate
> new segments, crawl, then repeat. Can this be done continuously and
> simultaneously, so that we are constantly using our crawl bandwidth? (I.e.
> is there an issue generating new segments while crawls and db updates are
> happening?) I wonder this especially because we're interested in keeping as
> live a dataset as possible; most of the docs seem to indicate that a large
> crawl may take on the order of weeks, and thus a new link may not be
> indexed until the following cycle a month or two after we grab or inject it.


It is not recommended to overlap crawls and database updates with Nutch but it 
is possible, but usually also not required. Even with very large websites you 
must pay attention to politeness, you cannot or should not do more than a page 
every few seconds. This still means you can crawl a large amount of pages in a 
month. It is usually not interesting to re-crawl a specific page more than once 
a month, except for pages that list new URL's. So once you crawled the entire 
site, which may take some time, you are good to go. If you design the set up so 
it does not do large fetches (segments) and does not spend much time updating 
databases, you can crawl a lot and still be fresh. For example, our site search 
platform has continuous crawl cycles that take no longer than 15 minutes, this 
means that new pages are discovered and indexed within 30 minutes, even for 
sites that have a few million URL's.
 
> 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2 which
> allows pluggable backends via Gora. Yet I'm getting the (possibly false)
> impression that HDFS/Hadoop is still somehow involved in Nutch 2 (there's
> still a crawlDir and such referenced here:
> http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most interested in
> a Cassandra backend right now). If this is true how does it play in? Is
> Hadoop/HDFS used for job distribution and intermediate data while all
> permanent data is in Cassandra?

Both can run on Hadoop, but Nutch 1.x uses Hadoop sequence files to store data 
where Nutch 2.x uses Gora to abstract storage. Hadoop's Map/Reduce framework is 
uses in both versions. Both can run fine but 1.x is considered the main stable 
distribution and has some more features. At this time 1.x is also still faster 
than 2.x, but this may not be a problem if your data isn't large. 
So mention you operate a full-webcrawler, does this mean you have billions of 
billions of records? I do not know how Nutch 2.x with Cassandra will deal with 
that, Nutch 1.x can deal with it provided that you have powerful hardware, 
although you would need that anyway. If you just have a few million, you 
wouldn't even need Hadoop to distribute your jobs.
 
> 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are there
> any controls regarding how often to retry previously fetched links (maybe
> depending on their return code, whether they had changes, pagerank, etc.),
> and how often to try newly fetched links? My reading so far indicates that
> with the default 30-day refresh interval we'll simply try to re-crawl every
> single link every interval; if this is true then it seems like we would
> often be crawling pages that haven't changed.

Nutch allows for pluggable implementations of a fetch schedule. It allows fine 
grained control over rescheduling behaviour. We ship with a default and also an 
adaptive scheduler, that one will for example recrawl frequent changing pages 
more frequently. The downside is that it will also recrawl overview (or hub) 
pages more frequently. Although they allow you to discover new content, you 
only need to crawl them once, except for the first overview page that lists 
very recently added content. But using parsers plugins that can detect such 
pages and set some values, and a custom fetch schedule, you can solve such 
problems.

> 
> Thanks!
> -dan
>

RE: Clarifications regarding re-crawl and Nutch2 storage

Reply via email to