Hi

There has been a lot of changes in 2.x recently notably the use of filtered
scans in GORA, which addresses this issue.

Please checkout 2.x
(https://svn.apache.org/repos/asf/nutch/branches/2.x/) and give it a try.

See https://issues.apache.org/jira/browse/NUTCH-1674 and
https://issues.apache.org/jira/browse/NUTCH-1714 for details.

Julien


On Sunday, 25 May 2014, Azhar Jassal <[email protected]> wrote:

> Hi
>
> I'm using Nutch 2.2.1
>
> Each of the 4 jobs in the crawl cycle, as explained here need to reread the
> entire webtable to get started:
> http://wiki.apache.org/nutch/Nutch2Crawling
>
> This is a serious bottleneck for my use case.
>
> I know that the fetch and parse job can be combined via the Nutch config.
> This removes the need for the parse job to be run separately- and therefore
> the webtable does not to be read again.
>
> The page I linked to states that a future development might be combining
> the generate and fetch stages so that only one read of the webtable is
> required.
>
> Has anyone attempted to do is? Is there a patch out there for a combined
> generator and fetch job?
>
> Thanks
>
> Az
>

Reply via email to