RE: Clarifications regarding re-crawl and Nutch2 storage

Markus Jelsma Mon, 16 Jun 2014 16:09:08 -0700

Hi Dan, see inline again.
Markus
 
-----Original message-----
> From:Dan Kinder <[email protected]>
> Sent: Tuesday 17th June 2014 0:32
> To: [email protected]
> Subject: Re: Clarifications regarding re-crawl and Nutch2 storage
> 
> Thanks for the response Markus, a few clarification questions below.
> 
> On Mon, Jun 16, 2014 at 3:01 PM, Markus Jelsma <[email protected]>
> wrote:
> 
> > Hi Dan, please see inline for comments.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:Dan Kinder <[email protected]>
> > > Sent: Monday 16th June 2014 23:32
> > > To: [email protected]
> > > Subject: Clarifications regarding re-crawl and Nutch2 storage
> > >
> > > Hi there,
> > >
> > > My company currently runs a full-web crawler (focusing on written content
> > > including content from PDFs, word docs, etc. to support our product).
> > It's
> > > fully proprietary (including the indexing solution) and fairly old.
> > >
> > > We're looking to potentially upgrade and I've been reading quite a bit
> > > about Nutch. It seems promising but I have questions I've had trouble
> > > finding answers to in the existing wikis and blogs. My apologies if I
> > just
> > > haven't dug deep enough on these; feel free to point to resources.
> > >
> > > 1) The Nutch examples generally seem to update the link database,
> > generate
> > > new segments, crawl, then repeat. Can this be done continuously and
> > > simultaneously, so that we are constantly using our crawl bandwidth?
> > (I.e.
> > > is there an issue generating new segments while crawls and db updates are
> > > happening?) I wonder this especially because we're interested in keeping
> > as
> > > live a dataset as possible; most of the docs seem to indicate that a
> > large
> > > crawl may take on the order of weeks, and thus a new link may not be
> > > indexed until the following cycle a month or two after we grab or inject
> > it.
> >
> > It is not recommended to overlap crawls and database updates with Nutch
> > but it is possible, but usually also not required. Even with very large
> > websites you must pay attention to politeness, you cannot or should not do
> > more than a page every few seconds. This still means you can crawl a large
> > amount of pages in a month. It is usually not interesting to re-crawl a
> > specific page more than once a month, except for pages that list new URL's.
> > So once you crawled the entire site, which may take some time, you are good
> > to go. If you design the set up so it does not do large fetches (segments)
> > and does not spend much time updating databases, you can crawl a lot and
> > still be fresh. For example, our site search platform has continuous crawl
> > cycles that take no longer than 15 minutes, this means that new pages are
> > discovered and indexed within 30 minutes, even for sites that have a few
> > million URL's.
> >
> 
> I can see what you mean, but as I said we are crawling the whole internet
> (as much as we can get). We do pay attention to politeness rules (and try
> to go faster for sites that set a low crawl-delay), but given the breadth
> of sites out there we'd like to be using our bandwidth as continuously as
> possible.
> 
> Are there any resources out there of people who have done overlapping
> crawls and DB updates so as to be continuously crawling? You say it's
> possible but I'm wondering if there are concerns in doing that.


I don't know of any resources that discuss this aspect of Nutch. Doing 
overlapping database updates is tricky because you need to plan it with the 
time it takes to do one cycle. you can generate more segments and continue 
crawling. The generate tool has a feature that makes sure the next generate 
round (without a update) does not generate the same records. The DB has a 
state, if you generate two segments without updating you will get the same 
segments.

It would need careful planning and trial and error to tune it properly. It 
would be hard to maximize bandwidth continuously, but you can use tricks such 
as partitioning the DB and have multiple crawlers run at the same time. An idea 
would be to have one crawler do .com, another one .de, .jp, .nl, one for .net 
and .org etc. You already have the data so you should be able to come up with 
an evenly spread list of TLD's or learn along the way. Both versions of Nutch 
allow you to point to either a path on the file system (1.x) or an address to a 
database.
 
> 
> >
> > > 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2 which
> > > allows pluggable backends via Gora. Yet I'm getting the (possibly false)
> > > impression that HDFS/Hadoop is still somehow involved in Nutch 2 (there's
> > > still a crawlDir and such referenced here:
> > > http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most interested
> > in
> > > a Cassandra backend right now). If this is true how does it play in? Is
> > > Hadoop/HDFS used for job distribution and intermediate data while all
> > > permanent data is in Cassandra?
> >
> > Both can run on Hadoop, but Nutch 1.x uses Hadoop sequence files to store
> > data where Nutch 2.x uses Gora to abstract storage. Hadoop's Map/Reduce
> > framework is uses in both versions. Both can run fine but 1.x is considered
> > the main stable distribution and has some more features. At this time 1.x
> > is also still faster than 2.x, but this may not be a problem if your data
> > isn't large.
> > So mention you operate a full-webcrawler, does this mean you have billions
> > of billions of records? I do not know how Nutch 2.x with Cassandra will
> > deal with that, Nutch 1.x can deal with it provided that you have powerful
> > hardware, although you would need that anyway. If you just have a few
> > million, you wouldn't even need Hadoop to distribute your jobs.
> >
> 
> Yes we have billions of billions of records, so based on what you're saying
> I would probably explore Nutch 1.x.
> 
> To clarify, when you say that both Nutch 1.x and 2.x use Hadoop Map/Reduce,
> you mean that even for 2.x I would need to run a Hadoop cluster that would
> do the actual crawling in Hadoop map jobs (even if the Map jobs simply talk
> to Cassandra via GORA), is that the case?

Well yes, with that amount of records you will need a cluster for sure. Both 
versions of Nutch rely on map/reduce jobs, they read data in the map phase and 
write data after the reduce phase. 1.x uses sequence files and 2.x Gora and the 
selected back end. Nutch 1.x does the actual fetching of URL's in mappers and 
2.x does it in the reducers. With very large jobs, 2.x has an advantage because 
of shuffling of data between map and reduce. You can work around that by doing 
many smaller jobs.

> 
> 
> > > 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are
> > there
> > > any controls regarding how often to retry previously fetched links (maybe
> > > depending on their return code, whether they had changes, pagerank,
> > etc.),
> > > and how often to try newly fetched links? My reading so far indicates
> > that
> > > with the default 30-day refresh interval we'll simply try to re-crawl
> > every
> > > single link every interval; if this is true then it seems like we would
> > > often be crawling pages that haven't changed.
> >
> > Nutch allows for pluggable implementations of a fetch schedule. It allows
> > fine grained control over rescheduling behaviour. We ship with a default
> > and also an adaptive scheduler, that one will for example recrawl frequent
> > changing pages more frequently. The downside is that it will also recrawl
> > overview (or hub) pages more frequently. Although they allow you to
> > discover new content, you only need to crawl them once, except for the
> > first overview page that lists very recently added content. But using
> > parsers plugins that can detect such pages and set some values, and a
> > custom fetch schedule, you can solve such problems.
> >
> 
> Thanks I didn't see the AdaptiveFetchSchedule (found the docs here:
> http://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/crawl/AdaptiveFetchSchedule.html),
> that does mostly answer my question.
> 
> I also just found a page that helped me regarding the "failed" fetches:
> http://wiki.apache.org/nutch/CrawlDatumStates. It's a little old but I'm
> assuming the behavior hasn't changed much. Seems like for 404, 500, or
> other non-successful fetches it will stop trying after a few times.

Nutch will in the end always retry, everything. Unless you make some adaptation 
to not to do it.

I am curious, if you craw so much, you cannot manually maintain rules per-site. 
So how do you currently deal with stuff like crawler traps and duplicate hosts? 
We have found that crawler traps are an interesting problem, you can either 
cluster them and classify those clusters or simply limit the depth from the / 
of a website. But if you limit the depth, you will never discover, for example, 
very old forum posts that are hidden at the 256th page of a thread overview. 
And how do you currently deal with duplicate hosts? Many websites have the 
typical problem of www. and non-www. addressess, for large websites this means 
a million additional useless records and data etc. Also, many websites have 
more interesting hostname duplicates. We have seen many sites having a dozen 
different hostnames for the same content, some crazy webmasters even (maybe 
deliberately) generate thousands! We have also seen that many adult websites 
generate thousands of hostnames for the same content.

We have seen that addressing these horrible issues of the web solved so many 
problems that so much less IO, CPU, RAM and bandwidth is being wasted. 
Maximising bandwidth would be the last on my list, because what is the use if 
you download so much rubbish.

> 
> 
> >
> > >
> > > Thanks!
> > > -dan
> > >
> >
>

RE: Clarifications regarding re-crawl and Nutch2 storage

Reply via email to