RE: Clarifications regarding re-crawl and Nutch2 storage

Markus Jelsma Tue, 17 Jun 2014 11:06:22 -0700

Thanks, interesting to hear!

Nutch has no features for spider trap detection (yet). But you can get a long 
way using complex generic regular expressions, and make sure you get rid of 
almost everything that looks like a calendar feature. Use the existing 
urlfilter-regex plugin for this.


Nutch already has a feature that allows you to remap hosts using the 
urlnormalizer-host plugin but detection code that generates such a config file 
has not been added yet. Also, the urlnormalizer-plugin won't work at a very 
large scale because the list of remapped hosts (which is no more than a 
Map<String, String>) would eat away your memory. Changing the code to use an 
FST would make sense as it compresses the data massively, but still, you won't 
like 100+ MB in each mapper task.

There is also a problem of unknown hosts, they are retried so this is a 
problem. You can work around this using the domainblacklist-urlfilter plugin, 
but again, this is also a manual task to fill. It uses a Set<String> inside so 
it has the same issues as i described above. The hostdb tool (uncommitted code) 
can be used to automatically fill the config file for this plugin.

Markus
 
 
-----Original message-----
> From:Dan Kinder <[email protected]>
> Sent: Tuesday 17th June 2014 19:24
> To: [email protected]
> Subject: Re: Clarifications regarding re-crawl and Nutch2 storage
> 
> Thanks again for the quick response, see inline.
> 
> On Mon, Jun 16, 2014 at 4:08 PM, Markus Jelsma <[email protected]>
> wrote:
> 
> > Hi Dan, see inline again.
> > Markus
> >
> > -----Original message-----
> > > From:Dan Kinder <[email protected]>
> > > Sent: Tuesday 17th June 2014 0:32
> > > To: [email protected]
> > > Subject: Re: Clarifications regarding re-crawl and Nutch2 storage
> > >
> > > Thanks for the response Markus, a few clarification questions below.
> > >
> > > On Mon, Jun 16, 2014 at 3:01 PM, Markus Jelsma <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi Dan, please see inline for comments.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > > -----Original message-----
> > > > > From:Dan Kinder <[email protected]>
> > > > > Sent: Monday 16th June 2014 23:32
> > > > > To: [email protected]
> > > > > Subject: Clarifications regarding re-crawl and Nutch2 storage
> > > > >
> > > > > Hi there,
> > > > >
> > > > > My company currently runs a full-web crawler (focusing on written
> > content
> > > > > including content from PDFs, word docs, etc. to support our product).
> > > > It's
> > > > > fully proprietary (including the indexing solution) and fairly old.
> > > > >
> > > > > We're looking to potentially upgrade and I've been reading quite a
> > bit
> > > > > about Nutch. It seems promising but I have questions I've had trouble
> > > > > finding answers to in the existing wikis and blogs. My apologies if I
> > > > just
> > > > > haven't dug deep enough on these; feel free to point to resources.
> > > > >
> > > > > 1) The Nutch examples generally seem to update the link database,
> > > > generate
> > > > > new segments, crawl, then repeat. Can this be done continuously and
> > > > > simultaneously, so that we are constantly using our crawl bandwidth?
> > > > (I.e.
> > > > > is there an issue generating new segments while crawls and db
> > updates are
> > > > > happening?) I wonder this especially because we're interested in
> > keeping
> > > > as
> > > > > live a dataset as possible; most of the docs seem to indicate that a
> > > > large
> > > > > crawl may take on the order of weeks, and thus a new link may not be
> > > > > indexed until the following cycle a month or two after we grab or
> > inject
> > > > it.
> > > >
> > > > It is not recommended to overlap crawls and database updates with Nutch
> > > > but it is possible, but usually also not required. Even with very large
> > > > websites you must pay attention to politeness, you cannot or should
> > not do
> > > > more than a page every few seconds. This still means you can crawl a
> > large
> > > > amount of pages in a month. It is usually not interesting to re-crawl a
> > > > specific page more than once a month, except for pages that list new
> > URL's.
> > > > So once you crawled the entire site, which may take some time, you are
> > good
> > > > to go. If you design the set up so it does not do large fetches
> > (segments)
> > > > and does not spend much time updating databases, you can crawl a lot
> > and
> > > > still be fresh. For example, our site search platform has continuous
> > crawl
> > > > cycles that take no longer than 15 minutes, this means that new pages
> > are
> > > > discovered and indexed within 30 minutes, even for sites that have a
> > few
> > > > million URL's.
> > > >
> > >
> > > I can see what you mean, but as I said we are crawling the whole internet
> > > (as much as we can get). We do pay attention to politeness rules (and try
> > > to go faster for sites that set a low crawl-delay), but given the breadth
> > > of sites out there we'd like to be using our bandwidth as continuously as
> > > possible.
> > >
> > > Are there any resources out there of people who have done overlapping
> > > crawls and DB updates so as to be continuously crawling? You say it's
> > > possible but I'm wondering if there are concerns in doing that.
> >
> > I don't know of any resources that discuss this aspect of Nutch. Doing
> > overlapping database updates is tricky because you need to plan it with the
> > time it takes to do one cycle. you can generate more segments and continue
> > crawling. The generate tool has a feature that makes sure the next generate
> > round (without a update) does not generate the same records. The DB has a
> > state, if you generate two segments without updating you will get the same
> > segments.
> >
> > It would need careful planning and trial and error to tune it properly. It
> > would be hard to maximize bandwidth continuously, but you can use tricks
> > such as partitioning the DB and have multiple crawlers run at the same
> > time. An idea would be to have one crawler do .com, another one .de, .jp,
> > .nl, one for .net and .org etc. You already have the data so you should be
> > able to come up with an evenly spread list of TLD's or learn along the way.
> > Both versions of Nutch allow you to point to either a path on the file
> > system (1.x) or an address to a database.
> >
> > >
> > > >
> > > > > 2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2
> > which
> > > > > allows pluggable backends via Gora. Yet I'm getting the (possibly
> > false)
> > > > > impression that HDFS/Hadoop is still somehow involved in Nutch 2
> > (there's
> > > > > still a crawlDir and such referenced here:
> > > > > http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most
> > interested
> > > > in
> > > > > a Cassandra backend right now). If this is true how does it play in?
> > Is
> > > > > Hadoop/HDFS used for job distribution and intermediate data while all
> > > > > permanent data is in Cassandra?
> > > >
> > > > Both can run on Hadoop, but Nutch 1.x uses Hadoop sequence files to
> > store
> > > > data where Nutch 2.x uses Gora to abstract storage. Hadoop's Map/Reduce
> > > > framework is uses in both versions. Both can run fine but 1.x is
> > considered
> > > > the main stable distribution and has some more features. At this time
> > 1.x
> > > > is also still faster than 2.x, but this may not be a problem if your
> > data
> > > > isn't large.
> > > > So mention you operate a full-webcrawler, does this mean you have
> > billions
> > > > of billions of records? I do not know how Nutch 2.x with Cassandra will
> > > > deal with that, Nutch 1.x can deal with it provided that you have
> > powerful
> > > > hardware, although you would need that anyway. If you just have a few
> > > > million, you wouldn't even need Hadoop to distribute your jobs.
> > > >
> > >
> > > Yes we have billions of billions of records, so based on what you're
> > saying
> > > I would probably explore Nutch 1.x.
> > >
> > > To clarify, when you say that both Nutch 1.x and 2.x use Hadoop
> > Map/Reduce,
> > > you mean that even for 2.x I would need to run a Hadoop cluster that
> > would
> > > do the actual crawling in Hadoop map jobs (even if the Map jobs simply
> > talk
> > > to Cassandra via GORA), is that the case?
> >
> > Well yes, with that amount of records you will need a cluster for sure.
> > Both versions of Nutch rely on map/reduce jobs, they read data in the map
> > phase and write data after the reduce phase. 1.x uses sequence files and
> > 2.x Gora and the selected back end. Nutch 1.x does the actual fetching of
> > URL's in mappers and 2.x does it in the reducers. With very large jobs, 2.x
> > has an advantage because of shuffling of data between map and reduce. You
> > can work around that by doing many smaller jobs.
> >
> > >
> > >
> > > > > 3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are
> > > > there
> > > > > any controls regarding how often to retry previously fetched links
> > (maybe
> > > > > depending on their return code, whether they had changes, pagerank,
> > > > etc.),
> > > > > and how often to try newly fetched links? My reading so far indicates
> > > > that
> > > > > with the default 30-day refresh interval we'll simply try to re-crawl
> > > > every
> > > > > single link every interval; if this is true then it seems like we
> > would
> > > > > often be crawling pages that haven't changed.
> > > >
> > > > Nutch allows for pluggable implementations of a fetch schedule. It
> > allows
> > > > fine grained control over rescheduling behaviour. We ship with a
> > default
> > > > and also an adaptive scheduler, that one will for example recrawl
> > frequent
> > > > changing pages more frequently. The downside is that it will also
> > recrawl
> > > > overview (or hub) pages more frequently. Although they allow you to
> > > > discover new content, you only need to crawl them once, except for the
> > > > first overview page that lists very recently added content. But using
> > > > parsers plugins that can detect such pages and set some values, and a
> > > > custom fetch schedule, you can solve such problems.
> > > >
> > >
> > > Thanks I didn't see the AdaptiveFetchSchedule (found the docs here:
> > >
> > http://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> > ),
> > > that does mostly answer my question.
> > >
> > > I also just found a page that helped me regarding the "failed" fetches:
> > > http://wiki.apache.org/nutch/CrawlDatumStates. It's a little old but I'm
> > > assuming the behavior hasn't changed much. Seems like for 404, 500, or
> > > other non-successful fetches it will stop trying after a few times.
> >
> > Nutch will in the end always retry, everything. Unless you make some
> > adaptation to not to do it.
> >
> > I am curious, if you craw so much, you cannot manually maintain rules
> > per-site. So how do you currently deal with stuff like crawler traps and
> > duplicate hosts? We have found that crawler traps are an interesting
> > problem, you can either cluster them and classify those clusters or simply
> > limit the depth from the / of a website. But if you limit the depth, you
> > will never discover, for example, very old forum posts that are hidden at
> > the 256th page of a thread overview. And how do you currently deal with
> > duplicate hosts? Many websites have the typical problem of www. and
> > non-www. addressess, for large websites this means a million additional
> > useless records and data etc. Also, many websites have more interesting
> > hostname duplicates. We have seen many sites having a dozen different
> > hostnames for the same content, some crazy webmasters even (maybe
> > deliberately) generate thousands! We have also seen that many adult
> > websites generate thousands of hostnames for the same content.
> >
> > We have seen that addressing these horrible issues of the web solved so
> > many problems that so much less IO, CPU, RAM and bandwidth is being wasted.
> > Maximising bandwidth would be the last on my list, because what is the use
> > if you download so much rubbish.
> >
> 
> Regarding duplicate hosts: we have found this to be a problem particularly
> with subdomains (no so much with TLDs). We handle this basically by giving
> fair crawl coverage at a TLD level. Think of it like handling the subdomain
> as just a part of the path. This disadvantages us at covering sites with
> many paths and subdomains but helps us not get trapped in sites with
> millions of subdomains. We also do have some intelligent rules that try to
> figure out if a domain and it's subdomain have all the same path content
> (ex. the www. and non-www. case).
> 
> Regarding crawler traps, that's much harder as you say, we're still trying
> to mitigate that difficulty.
> 
> Does Nutch do anything out of the box to handle these things?
> 
> 
> > >
> > >
> > > >
> > > > >
> > > > > Thanks!
> > > > > -dan
> > > > >
> > > >
> > >
> >
>

RE: Clarifications regarding re-crawl and Nutch2 storage

Reply via email to