Not sure why this didn't go to the list.

---------- Forwarded message ----------
From: Markus Jelsma <[email protected]>
Date: Thu, Sep 22, 2011 at 3:17 PM
Subject: Re: Nutch crawl vs other commands
To: Bai Shen <[email protected]>


hi, reply to the list

> On Thu, Sep 22, 2011 at 2:01 PM, Markus Jelsma
>
> <[email protected]>wrote:
> > Not really. Once the you've got many links pointing to eachother, the
> > concept
> > of depth no longer really applies. You don't have to manage the DB
> > manually as
> > it will regulate itself (either by using a custom fetch scheduler).
> > Nutch will select URL's due for fetch and will in the end exhaust the
> > full list of URL's, unless you're crawling the internet. Fetched URL's
> > will be refetched over time.
>
> So what's the best way to set up a schedule?  The fetching and parsing
> steps seem pretty linked due to the segments, etc.
>
> > Because the fetcher runs as a Hadoop mapred job. When the actual fetch
> > finishes Hadoop must write the contents, merge spilled records etc. This
> > is part of how mapred works.
>
> But shouldn't that be happening during the parse stage?  The fetcher is
> constantly writing out data to the mapred job while it's fetching.  Once
> it's done, that should be it AFAIK.  And then the parse command runs the
> mapred job.
>
> Somewhere in the first 1.x version. Later it became a parse option that
>
> > actually never worked anyway until it was fixed in the current 1.4-dev.
> > Still,
> > it's not recommended to parse during the fetch stage.
>
> -nods-  That's why I'm doing noParsing.  But you said that doesn't do
> anything anymore.  So what would the updated commands be fore v1.3?
>
> >  As a mentioned in the other reply, it writes out data:
> >
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/java/org/apach
> > e/nutch/fetcher/FetcherOutputFormat.java?view=markup
> >
> > This will take a while indeed and it won't log anything during its
> > execution.
>
> But that should be happening during the fetching, not after, right?

Reply via email to