Not sure why this didn't go to the list. ---------- Forwarded message ---------- From: Markus Jelsma <[email protected]> Date: Thu, Sep 22, 2011 at 3:17 PM Subject: Re: Nutch crawl vs other commands To: Bai Shen <[email protected]>
hi, reply to the list > On Thu, Sep 22, 2011 at 2:01 PM, Markus Jelsma > > <[email protected]>wrote: > > Not really. Once the you've got many links pointing to eachother, the > > concept > > of depth no longer really applies. You don't have to manage the DB > > manually as > > it will regulate itself (either by using a custom fetch scheduler). > > Nutch will select URL's due for fetch and will in the end exhaust the > > full list of URL's, unless you're crawling the internet. Fetched URL's > > will be refetched over time. > > So what's the best way to set up a schedule? The fetching and parsing > steps seem pretty linked due to the segments, etc. > > > Because the fetcher runs as a Hadoop mapred job. When the actual fetch > > finishes Hadoop must write the contents, merge spilled records etc. This > > is part of how mapred works. > > But shouldn't that be happening during the parse stage? The fetcher is > constantly writing out data to the mapred job while it's fetching. Once > it's done, that should be it AFAIK. And then the parse command runs the > mapred job. > > Somewhere in the first 1.x version. Later it became a parse option that > > > actually never worked anyway until it was fixed in the current 1.4-dev. > > Still, > > it's not recommended to parse during the fetch stage. > > -nods- That's why I'm doing noParsing. But you said that doesn't do > anything anymore. So what would the updated commands be fore v1.3? > > > As a mentioned in the other reply, it writes out data: > > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/java/org/apach > > e/nutch/fetcher/FetcherOutputFormat.java?view=markup > > > > This will take a while indeed and it won't log anything during its > > execution. > > But that should be happening during the fetching, not after, right?

