[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-19 Thread Arthur B (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588892#comment-15588892
 ] 

Arthur B commented on NUTCH-2328:
-

Well then ... the issue stands. Cheers for the info

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588759#comment-15588759
 ] 

Sebastian Nagel commented on NUTCH-2328:


Hi [~arthur-evozon],

> Btw., I'm even not sure whether counters in the task context are updated with 
> global job values.

The answer is: counters in task context *do not show aggregated values* over 
all tasks of a job.
That's more or less clear from these definitions: "Counter is a facility for 
MapReduce applications to report its statistics." 
[[1|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Counter]],
 more details in Tom White's "Hadoop, the definitive guide". Counter values are 
propagated and aggregated in this direction: task > task tracker > job tracker 
> job client. The aggregated values are not sent back. There are hacks to 
access the aggregated values from the tasks, cf. 
[[2|http://stackoverflow.com/questions/8009802/is-there-a-way-to-access-number-of-successful-map-tasks-from-a-reduce-task-in-an/8013573#8013573]].
 But the central problem persists: the value of a job counter depends mostly on 
how many tasks are in progress or already done. If the programming logic relies 
on such a value it becomes largely indeterministic.

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-19 Thread Arthur B (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588681#comment-15588681
 ] 

Arthur B commented on NUTCH-2328:
-

Just making one last case for using a global job counter (with the caveat that 
I am not that as familiar with the internals of Nutch as you): 
* you are already incrementing this counter, so I am wondering how much more 
overhead would be a read?; and 
* what I noticed from tests that I ran is that you will actually will be losing 
more than {{topN mod mapred.reduce.tasks}} due to reducer starvation (I think) 
- roughly 60-70% of topN was achieved in my tests.

Just a thought. 

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-19 Thread Arthur B (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588484#comment-15588484
 ] 

Arthur B commented on NUTCH-2328:
-

Thanks for the heads up. 
Re
* local, per-reducer limit = topN / number of reducers 

I guess if you do not care about missing generating {{'topN' mod 
'mapred.reduce.tasks'}} pages this would work as well. I do not know how 
expensive the call to the marked counter would be (if you care about exact 
numbers)

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-19 Thread Arthur B (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588478#comment-15588478
 ] 

Arthur B commented on NUTCH-2328:
-

Thanks for the heads up.
>> * local, per-reducer limit = topN / number of reducers - I guess if you do 
>> not care about missing generating  {{ topN mod mapred.reduce.tasks }}  pages 
>> this would work as well. I do not know how expensive the call to the marked 
>> counter would be (if you care about exact numbers)


> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586690#comment-15586690
 ] 

Sebastian Nagel commented on NUTCH-2328:


> the only solution is to have a cluster wide propagated count 

No, this is not required. The solution with an instance variable is by design:
- local, per-reducer limit = topN / number of reducers
- every reducer checks only for the local limit
- in sum, there will be topN URLs generated

The condition is that URLs are evenly distributed across different hosts (at 
least as many as there are reducers), cf. 
[[1|https://www.mail-archive.com/user@nutch.apache.org/msg14499.html]].

A job-wide counter does not guarantee any limits betters because there is no 
control how reduce tasks are launched in time. Only if all tasks run in 
parallel, with similar speed and no task fails, an even distribution across 
reducers/parts would be achieved. But that will hardly happen in a production 
Hadoop cluster. In a realistic scenario some tasks are launched first and will 
get more URLs. The tasks launched later get less or even no URLs. However, to 
achieve an optimal utilization of the fetcher, all parts should be of equal 
size.

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-18 Thread Arthur B (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586425#comment-15586425
 ] 

Arthur B commented on NUTCH-2328:
-

I don't think this has anything to do with Spring/Hadoop: it just recreates the 
issue. I believe you can recreate this from a standalone app that will submit 
the GeneratorJob twice to Hadoop (I recon we can disregard that aspect - Spring 
Data Hadoop). 
Regarding the locality of the {{count}} variable: I do not believe that turning 
it into a {{instance}} member would do here, unless I am missing something I do 
not think that it will be solving the issue. My reasoning for this is that a 
{{GeneratorJob}} once submitted to Hadoop, you can not count on the fact that 
it will only reside on one Hadoop node. Potentially the {{M/R job}} will run in 
multiple Hadoop nodes, and the {{M/R job}} should not have state as such (you 
can not count on its locality). So in my opinion the only solution is to have a 
cluster wide propagated {{count}}er that keeps track of how many {{Webpage}}s 
have been dealt with. 
AFAIK a local class {{instance}} variable would not do this (propagate a 
cluster wide counter)... unless I am missing something. It can be easily tested 
by running it on a 2 cluster machine and letting more than one Reducers run up. 

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586031#comment-15586031
 ] 

Sebastian Nagel commented on NUTCH-2328:


I don't know what's specific to Spring for Hadoop but in (pseudo)distributed 
mode an instance variable should be task-local and not shared across the 
cluster.  If understood correctly, the problem is strictly speaking not that 
the variable is shared but that it survives the life cycle of a mapreduce task. 
 Normally there is only one Mapper or Reducer object per JVM. By configuration 
there can be more in parallel threads but every task should have it's own 
instance. If {{limit}} is per reduce task (= topN / numReducers) also {{count}} 
should be. Otherwise with multiple reducers the generator stops too early. And 
it must be per task because no globals are predictable in a distributed 
environment: if reduce tasks fail, the global job counts can move backwards. 
Btw., I'm even not sure whether counters in the task context are updated with 
global job values.

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-18 Thread Arthur B (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585843#comment-15585843
 ] 

Arthur B commented on NUTCH-2328:
-

The environment that this was observed:
* spring batched jobs / spring data hadoop, submitting GeneratorJob M/R jobs, 
so obviously the GeneratorReducer#count survived as static;
* in a pseudo-distributed Hadoop setup

Maybe I was missing something but the reason was cautious about making count 
instance field because this counter would have to be seen across the whole 
Hadoop cluster, right? So thats why I relied on the M/R counter to sync across 
the actual hbase pages counted as processed.

> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

2016-10-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585585#comment-15585585
 ] 

Sebastian Nagel commented on NUTCH-2328:


Thanks, [~arthur-evozon]. Good catch!

In which environment was the problem observed? E.g.,
- running Nutch via bin/crawl or via Nutch server ?
- in local / pseudo-distributed / fully distributed Hadoop environment ?

The variable {{count}} should not be static, that's definitely a problem when 
running multiple Generator jobs from a long-running Nutch server in local mode 
where all tasks are run in the same JVM.

The variable {{limit}} is a per-task limit (see how it's initialized in 
{{setup(context)}}), comparing it with a global counter seems wrong, also 
retrieving the counter in every call of the reduce function may be too 
expensive. Why not make {{count}} an instance variable?



> GeneratorJob does not generate anything on second run
> -
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
>Reporter: Arthur B
>  Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)