> Is this a Problem of The nutch which fails to fetch huge list of URLs?

Probably not, it is able to fetch millions of URLs in a single fetch list.

Is there a timelimit set (property fetcher.timelimit.mins). That could explain
why a large list isn't fetched. There should be a message in the log files 
about the reason.

Best,
Sebastian

On 04/12/2017 06:55 AM, shubham.gupta wrote:
> Hey
> 
> I have around 5000 URLs in my seed Url list. If I inject the whole list, then 
> it fails to fetch all
> documents and parse. The depth is set to 1.
> 
> But when the list is divided into a batch of 1000 URLs then it is able to 
> fetch and parse all
> documents successfully.
> 
> In the former case 5141 URLs are injected, out of which 5127 URLs are 
> generated and only 1300 URLs
> get fetched with status 2. Out of the rest 1342 do not have a status 2 and 
> the rest are unfetched.
> 
> While, when the list is small, the total count of documents is 3220, out of 
> which the documents with
> status 2 are 1298, the documents with status code other than 2 are 1922. And, 
> the count of documents
> which have not been fetched yet is 1.
> 
> Is this a Problem of The nutch which fails to fetch huge list of URLs? Or 
> some changes need to be
> made in the configuration files.
> 
> Please reply soon.
> 

Reply via email to