Hi There has been a lot of changes in 2.x recently notably the use of filtered scans in GORA, which addresses this issue.
Please checkout 2.x (https://svn.apache.org/repos/asf/nutch/branches/2.x/) and give it a try. See https://issues.apache.org/jira/browse/NUTCH-1674 and https://issues.apache.org/jira/browse/NUTCH-1714 for details. Julien On Sunday, 25 May 2014, Azhar Jassal <[email protected]> wrote: > Hi > > I'm using Nutch 2.2.1 > > Each of the 4 jobs in the crawl cycle, as explained here need to reread the > entire webtable to get started: > http://wiki.apache.org/nutch/Nutch2Crawling > > This is a serious bottleneck for my use case. > > I know that the fetch and parse job can be combined via the Nutch config. > This removes the need for the parse job to be run separately- and therefore > the webtable does not to be read again. > > The page I linked to states that a future development might be combining > the generate and fetch stages so that only one read of the webtable is > required. > > Has anyone attempted to do is? Is there a patch out there for a combined > generator and fetch job? > > Thanks > > Az >

