Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-14 Thread Sebastian Nagel
Hi,

amazing, that's since ever (commit 5943b9f1 by Doug Cutting).

Yes, it's related to politeness, and it's correct - afaics.

Generator.Selector implements 3 methods:

map() - select unfetched entries and sort entries by decreasing score (more 
relevant ones first)

partition() - ensure that all entries of one host end up in the same partion

reduce() - applies limits (topN and "generate.max.count")

Because reducers cannot communicate the limits have to be adjusted in relation
to the number of reducers.  For the per-host/ip limit this is not necessary
because all URLs of one host/ip are in the same partition and hence in the same
reducer.  But the "global" limit topN must be adapted.  If the URLs
are evenly over multiple hosts/ips that's ok, but in your case
(I assume all URLs belong to a single host) there will be exactly one partition
which then contains less URLs.

I don't know whether this can be fixed. Would need to adjust the limit
if there are fewer partitions than expected.
But the logging could be improved for sure.

Cheers,
Sebastian

On 04/14/2016 11:11 AM, Karanjeet Singh wrote:
> Thanks, Sebastian.
> 
> This is solved now. I looked through the code and found that Nutch has a
> limit placed on the count of host URLs which is defined by *topN / number
> of reducer tasks*. Please refer here [0].
> 
> So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 /
> 16).
> 
> I am interested to know the reason for this. Is it due to politeness?
> 
> [0]:
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141
> 
> Regards,
> Karanjeet Singh
> USC
> 
> On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel > wrote:
> 
>> Hi,
>>
>> I didn't anything wrong. Did you check whether
>> CrawlDb entries are marked as "generated"
>> by "_ngt_="?  With generate.update.crawldb=true
>> it may happen that after having run generate
>> multiple times, only 62 unfetched and not-generated
>> entries remain.
>>
>> Sebastian
>>
>> On 04/14/2016 03:31 AM, Karanjeet Singh wrote:
>>> Hello,
>>>
>>> I am trying to crawl a website using Nutch on Hadoop cluster. I have
>>> modified the crawl script to restrict the sizeFetchList to 1000 (which is
>>> the topN value for nutch generate command).
>>>
>>> However, as I see, Nutch is only generating 62 URLs where the unfetched
>> URL
>>> count is 5,000 (approx). I am using the below command:
>>>
>>> nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
>>> mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>> mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN
>> 1000
>>> -numFetchers 1 -noFilter
>>>
>>> Can anyone please look into this and let me know if I am missing
>> something.
>>> Please find the crawl configuration here [0].
>>>
>>> [0]:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg=
>>>
>>> Thanks & Regards,
>>> Karanjeet Singh
>>> USC
>>> ᐧ
>>>
>>
>>
> ᐧ
> 



Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-14 Thread Karanjeet Singh
Thanks, Sebastian.

This is solved now. I looked through the code and found that Nutch has a
limit placed on the count of host URLs which is defined by *topN / number
of reducer tasks*. Please refer here [0].

So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 /
16).

I am interested to know the reason for this. Is it due to politeness?

[0]:
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141

Regards,
Karanjeet Singh
USC

On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel  wrote:

> Hi,
>
> I didn't anything wrong. Did you check whether
> CrawlDb entries are marked as "generated"
> by "_ngt_="?  With generate.update.crawldb=true
> it may happen that after having run generate
> multiple times, only 62 unfetched and not-generated
> entries remain.
>
> Sebastian
>
> On 04/14/2016 03:31 AM, Karanjeet Singh wrote:
> > Hello,
> >
> > I am trying to crawl a website using Nutch on Hadoop cluster. I have
> > modified the crawl script to restrict the sizeFetchList to 1000 (which is
> > the topN value for nutch generate command).
> >
> > However, as I see, Nutch is only generating 62 URLs where the unfetched
> URL
> > count is 5,000 (approx). I am using the below command:
> >
> > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
> > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN
> 1000
> > -numFetchers 1 -noFilter
> >
> > Can anyone please look into this and let me know if I am missing
> something.
> > Please find the crawl configuration here [0].
> >
> > [0]:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg=
> >
> > Thanks & Regards,
> > Karanjeet Singh
> > USC
> > ᐧ
> >
>
>
ᐧ


Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-14 Thread Sebastian Nagel
Hi,

I didn't anything wrong. Did you check whether
CrawlDb entries are marked as "generated"
by "_ngt_="?  With generate.update.crawldb=true
it may happen that after having run generate
multiple times, only 62 unfetched and not-generated
entries remain.

Sebastian

On 04/14/2016 03:31 AM, Karanjeet Singh wrote:
> Hello,
> 
> I am trying to crawl a website using Nutch on Hadoop cluster. I have
> modified the crawl script to restrict the sizeFetchList to 1000 (which is
> the topN value for nutch generate command).
> 
> However, as I see, Nutch is only generating 62 URLs where the unfetched URL
> count is 5,000 (approx). I am using the below command:
> 
> nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
> mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000
> -numFetchers 1 -noFilter
> 
> Can anyone please look into this and let me know if I am missing something.
> Please find the crawl configuration here [0].
> 
> [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf
> 
> Thanks & Regards,
> Karanjeet Singh
> USC
> ᐧ
> 



Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-13 Thread Karanjeet Singh
Hello,

I am trying to crawl a website using Nutch on Hadoop cluster. I have
modified the crawl script to restrict the sizeFetchList to 1000 (which is
the topN value for nutch generate command).

However, as I see, Nutch is only generating 62 URLs where the unfetched URL
count is 5,000 (approx). I am using the below command:

nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000
-numFetchers 1 -noFilter

Can anyone please look into this and let me know if I am missing something.
Please find the crawl configuration here [0].

[0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf

Thanks & Regards,
Karanjeet Singh
USC
ᐧ