Re: Clarifications regarding re-crawl and Nutch2 storage

Dan Kinder Tue, 17 Jun 2014 10:25:04 -0700

Thanks again for the quick response, see inline.

On Mon, Jun 16, 2014 at 4:08 PM, Markus Jelsma <[email protected]>
wrote:


> Hi Dan, see inline again.
> Markus
>
> -----Original message-----
> > From:Dan Kinder <[email protected]>
> > Sent: Tuesday 17th June 2014 0:32
> > To: [email protected]
> > Subject: Re: Clarifications regarding re-crawl and Nutch2 storage
> >
> > Thanks for the response Markus, a few clarification questions below.
> >
> > On Mon, Jun 16, 2014 at 3:01 PM, Markus Jelsma <
> [email protected]>
> > wrote:
> >
> > > Hi Dan, please see inline for comments.
> > >
> > > Regards,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Dan Kinder <[email protected]>
> > > > Sent: Monday 16th June 2014 23:32
> > > > To: [email protected]
> > > > Subject: Clarifications regarding re-crawl and Nutch2 storage
> > > >
> > > > Hi there,
> > > >
> > > > My company currently runs a full-web crawler (focusing on written
> content
> > > > including content from PDFs, word docs, etc. to support our product).
> > > It's
> > > > fully proprietary (including the indexing solution) and fairly old.
> > > >
> > > > We're looking to potentially upgrade and I've been reading quite a
> bit
> > > > about Nutch. It seems promising but I have questions I've had trouble
> > > > finding answers to in the existing wikis and blogs. My apologies if I
> > > just
> > > > haven't dug deep enough on these; feel free to point to resources.
> > > >
> > > > 1) The Nutch examples generally seem to update the link database,
> > > generate
> > > > new segments, crawl, then repeat. Can this be done continuously and
> > > > simultaneously, so that we are constantly using our crawl bandwidth?
> > > (I.e.
> > > > is there an issue generating new segments while crawls and db
> updates are
> > > > happening?) I wonder this especially because we're interested in
> keeping
> > > as
> > > > live a dataset as possible; most of the docs seem to indicate that a
> > > large
> > > > crawl may take on the order of weeks, and thus a new link may not be
> > > > indexed until the following cycle a month or two after we grab or
> inject
> > > it.
> > >
> > > It is not recommended to overlap crawls and database updates with Nutch
> > > but it is possible, but usually also not required. Even with very large
> > > websites you must pay attention to politeness, you cannot or should
> not do
> > > more than a page every few seconds. This still means you can crawl a
> large
> > > amount of pages in a month. It is usually not interesting to re-crawl a
> > > specific page more than once a month, except for pages that list new
> URL's.
> > > So once you crawled the entire site, which may take some time, you are
> good
> > > to go. If you design the set up so it does not do large fetches
> (segments)
> > > and does not spend much time updating databases, you can crawl a lot
> and
> > > still be fresh. For example, our site search platform has continuous
> crawl
> > > cycles that take no longer than 15 minutes, this means that new pages
> are
> > > discovered and indexed within 30 minutes, even for sites that have a
> few
> > > million URL's.
> > >
> >
> > I can see what you mean, but as I said we are crawling the whole internet
> > (as much as we can get). We do pay attention to politeness rules (and try
> > to go faster for sites that set a low crawl-delay), but given the breadth
> > of sites out there we'd like to be using our bandwidth as continuously as
> > possible.
> >
> > Are there any resources out there of people who have done overlapping
> > crawls and DB updates so as to be continuously crawling? You say it's
> > possible but I'm wondering if there are concerns in doing that.
>
> I don't know of any resources that discuss this aspect of Nutch. Doing
> overlapping database updates is tricky because you need to plan it with the
> time it takes to do one cycle. you can generate more segments and continue
> crawling. The generate tool has a feature that makes sure the next generate
> round (without a update) does not generate the same records. The DB has a
> state, if you generate two segments without updating you will get the same
> segments.
>
> It would need careful planning and trial and error to tune it properly. It
> would be hard to maximize bandwidth continuously, but you can use tricks
> such as partitioning the DB and have multiple crawlers run at the same
> time. An idea would be to have one crawler do .com, another one .de, .jp,
> .nl, one for .net and .org etc. You already have the data so you should be
> able to come up with an evenly spread list of TLD's or learn along the way.
> Both versions of Nutch allow you to point to either a path on the file
> system (1.x) or an address to a database.
>
> >
> > >
> > > > 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2
> which
> > > > allows pluggable backends via Gora. Yet I'm getting the (possibly
> false)
> > > > impression that HDFS/Hadoop is still somehow involved in Nutch 2
> (there's
> > > > still a crawlDir and such referenced here:
> > > > http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most
> interested
> > > in
> > > > a Cassandra backend right now). If this is true how does it play in?
> Is
> > > > Hadoop/HDFS used for job distribution and intermediate data while all
> > > > permanent data is in Cassandra?
> > >
> > > Both can run on Hadoop, but Nutch 1.x uses Hadoop sequence files to
> store
> > > data where Nutch 2.x uses Gora to abstract storage. Hadoop's Map/Reduce
> > > framework is uses in both versions. Both can run fine but 1.x is
> considered
> > > the main stable distribution and has some more features. At this time
> 1.x
> > > is also still faster than 2.x, but this may not be a problem if your
> data
> > > isn't large.
> > > So mention you operate a full-webcrawler, does this mean you have
> billions
> > > of billions of records? I do not know how Nutch 2.x with Cassandra will
> > > deal with that, Nutch 1.x can deal with it provided that you have
> powerful
> > > hardware, although you would need that anyway. If you just have a few
> > > million, you wouldn't even need Hadoop to distribute your jobs.
> > >
> >
> > Yes we have billions of billions of records, so based on what you're
> saying
> > I would probably explore Nutch 1.x.
> >
> > To clarify, when you say that both Nutch 1.x and 2.x use Hadoop
> Map/Reduce,
> > you mean that even for 2.x I would need to run a Hadoop cluster that
> would
> > do the actual crawling in Hadoop map jobs (even if the Map jobs simply
> talk
> > to Cassandra via GORA), is that the case?
>
> Well yes, with that amount of records you will need a cluster for sure.
> Both versions of Nutch rely on map/reduce jobs, they read data in the map
> phase and write data after the reduce phase. 1.x uses sequence files and
> 2.x Gora and the selected back end. Nutch 1.x does the actual fetching of
> URL's in mappers and 2.x does it in the reducers. With very large jobs, 2.x
> has an advantage because of shuffling of data between map and reduce. You
> can work around that by doing many smaller jobs.
>
> >
> >
> > > > 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are
> > > there
> > > > any controls regarding how often to retry previously fetched links
> (maybe
> > > > depending on their return code, whether they had changes, pagerank,
> > > etc.),
> > > > and how often to try newly fetched links? My reading so far indicates
> > > that
> > > > with the default 30-day refresh interval we'll simply try to re-crawl
> > > every
> > > > single link every interval; if this is true then it seems like we
> would
> > > > often be crawling pages that haven't changed.
> > >
> > > Nutch allows for pluggable implementations of a fetch schedule. It
> allows
> > > fine grained control over rescheduling behaviour. We ship with a
> default
> > > and also an adaptive scheduler, that one will for example recrawl
> frequent
> > > changing pages more frequently. The downside is that it will also
> recrawl
> > > overview (or hub) pages more frequently. Although they allow you to
> > > discover new content, you only need to crawl them once, except for the
> > > first overview page that lists very recently added content. But using
> > > parsers plugins that can detect such pages and set some values, and a
> > > custom fetch schedule, you can solve such problems.
> > >
> >
> > Thanks I didn't see the AdaptiveFetchSchedule (found the docs here:
> >
> http://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> ),
> > that does mostly answer my question.
> >
> > I also just found a page that helped me regarding the "failed" fetches:
> > http://wiki.apache.org/nutch/CrawlDatumStates. It's a little old but I'm
> > assuming the behavior hasn't changed much. Seems like for 404, 500, or
> > other non-successful fetches it will stop trying after a few times.
>
> Nutch will in the end always retry, everything. Unless you make some
> adaptation to not to do it.
>
> I am curious, if you craw so much, you cannot manually maintain rules
> per-site. So how do you currently deal with stuff like crawler traps and
> duplicate hosts? We have found that crawler traps are an interesting
> problem, you can either cluster them and classify those clusters or simply
> limit the depth from the / of a website. But if you limit the depth, you
> will never discover, for example, very old forum posts that are hidden at
> the 256th page of a thread overview. And how do you currently deal with
> duplicate hosts? Many websites have the typical problem of www. and
> non-www. addressess, for large websites this means a million additional
> useless records and data etc. Also, many websites have more interesting
> hostname duplicates. We have seen many sites having a dozen different
> hostnames for the same content, some crazy webmasters even (maybe
> deliberately) generate thousands! We have also seen that many adult
> websites generate thousands of hostnames for the same content.
>
> We have seen that addressing these horrible issues of the web solved so
> many problems that so much less IO, CPU, RAM and bandwidth is being wasted.
> Maximising bandwidth would be the last on my list, because what is the use
> if you download so much rubbish.
>

Regarding duplicate hosts: we have found this to be a problem particularly
with subdomains (no so much with TLDs). We handle this basically by giving
fair crawl coverage at a TLD level. Think of it like handling the subdomain
as just a part of the path. This disadvantages us at covering sites with
many paths and subdomains but helps us not get trapped in sites with
millions of subdomains. We also do have some intelligent rules that try to
figure out if a domain and it's subdomain have all the same path content
(ex. the www. and non-www. case).

Regarding crawler traps, that's much harder as you say, we're still trying
to mitigate that difficulty.

Does Nutch do anything out of the box to handle these things?


> >
> >
> > >
> > > >
> > > > Thanks!
> > > > -dan
> > > >
> > >
> >
>

Re: Clarifications regarding re-crawl and Nutch2 storage

Reply via email to