[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588892#comment-15588892 ] Arthur B commented on NUTCH-2328: - Well then ... the issue stands. Cheers for the info > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588759#comment-15588759 ] Sebastian Nagel commented on NUTCH-2328: Hi [~arthur-evozon], > Btw., I'm even not sure whether counters in the task context are updated with > global job values. The answer is: counters in task context *do not show aggregated values* over all tasks of a job. That's more or less clear from these definitions: "Counter is a facility for MapReduce applications to report its statistics." [[1|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Counter]], more details in Tom White's "Hadoop, the definitive guide". Counter values are propagated and aggregated in this direction: task > task tracker > job tracker > job client. The aggregated values are not sent back. There are hacks to access the aggregated values from the tasks, cf. [[2|http://stackoverflow.com/questions/8009802/is-there-a-way-to-access-number-of-successful-map-tasks-from-a-reduce-task-in-an/8013573#8013573]]. But the central problem persists: the value of a job counter depends mostly on how many tasks are in progress or already done. If the programming logic relies on such a value it becomes largely indeterministic. > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588681#comment-15588681 ] Arthur B commented on NUTCH-2328: - Just making one last case for using a global job counter (with the caveat that I am not that as familiar with the internals of Nutch as you): * you are already incrementing this counter, so I am wondering how much more overhead would be a read?; and * what I noticed from tests that I ran is that you will actually will be losing more than {{topN mod mapred.reduce.tasks}} due to reducer starvation (I think) - roughly 60-70% of topN was achieved in my tests. Just a thought. > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588484#comment-15588484 ] Arthur B commented on NUTCH-2328: - Thanks for the heads up. Re * local, per-reducer limit = topN / number of reducers I guess if you do not care about missing generating {{'topN' mod 'mapred.reduce.tasks'}} pages this would work as well. I do not know how expensive the call to the marked counter would be (if you care about exact numbers) > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588478#comment-15588478 ] Arthur B commented on NUTCH-2328: - Thanks for the heads up. >> * local, per-reducer limit = topN / number of reducers - I guess if you do >> not care about missing generating {{ topN mod mapred.reduce.tasks }} pages >> this would work as well. I do not know how expensive the call to the marked >> counter would be (if you care about exact numbers) > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586690#comment-15586690 ] Sebastian Nagel commented on NUTCH-2328: > the only solution is to have a cluster wide propagated count No, this is not required. The solution with an instance variable is by design: - local, per-reducer limit = topN / number of reducers - every reducer checks only for the local limit - in sum, there will be topN URLs generated The condition is that URLs are evenly distributed across different hosts (at least as many as there are reducers), cf. [[1|https://www.mail-archive.com/user@nutch.apache.org/msg14499.html]]. A job-wide counter does not guarantee any limits betters because there is no control how reduce tasks are launched in time. Only if all tasks run in parallel, with similar speed and no task fails, an even distribution across reducers/parts would be achieved. But that will hardly happen in a production Hadoop cluster. In a realistic scenario some tasks are launched first and will get more URLs. The tasks launched later get less or even no URLs. However, to achieve an optimal utilization of the fetcher, all parts should be of equal size. > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586425#comment-15586425 ] Arthur B commented on NUTCH-2328: - I don't think this has anything to do with Spring/Hadoop: it just recreates the issue. I believe you can recreate this from a standalone app that will submit the GeneratorJob twice to Hadoop (I recon we can disregard that aspect - Spring Data Hadoop). Regarding the locality of the {{count}} variable: I do not believe that turning it into a {{instance}} member would do here, unless I am missing something I do not think that it will be solving the issue. My reasoning for this is that a {{GeneratorJob}} once submitted to Hadoop, you can not count on the fact that it will only reside on one Hadoop node. Potentially the {{M/R job}} will run in multiple Hadoop nodes, and the {{M/R job}} should not have state as such (you can not count on its locality). So in my opinion the only solution is to have a cluster wide propagated {{count}}er that keeps track of how many {{Webpage}}s have been dealt with. AFAIK a local class {{instance}} variable would not do this (propagate a cluster wide counter)... unless I am missing something. It can be easily tested by running it on a 2 cluster machine and letting more than one Reducers run up. > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586031#comment-15586031 ] Sebastian Nagel commented on NUTCH-2328: I don't know what's specific to Spring for Hadoop but in (pseudo)distributed mode an instance variable should be task-local and not shared across the cluster. If understood correctly, the problem is strictly speaking not that the variable is shared but that it survives the life cycle of a mapreduce task. Normally there is only one Mapper or Reducer object per JVM. By configuration there can be more in parallel threads but every task should have it's own instance. If {{limit}} is per reduce task (= topN / numReducers) also {{count}} should be. Otherwise with multiple reducers the generator stops too early. And it must be per task because no globals are predictable in a distributed environment: if reduce tasks fail, the global job counts can move backwards. Btw., I'm even not sure whether counters in the task context are updated with global job values. > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585843#comment-15585843 ] Arthur B commented on NUTCH-2328: - The environment that this was observed: * spring batched jobs / spring data hadoop, submitting GeneratorJob M/R jobs, so obviously the GeneratorReducer#count survived as static; * in a pseudo-distributed Hadoop setup Maybe I was missing something but the reason was cautious about making count instance field because this counter would have to be seen across the whole Hadoop cluster, right? So thats why I relied on the M/R counter to sync across the actual hbase pages counted as processed. > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
[ https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585585#comment-15585585 ] Sebastian Nagel commented on NUTCH-2328: Thanks, [~arthur-evozon]. Good catch! In which environment was the problem observed? E.g., - running Nutch via bin/crawl or via Nutch server ? - in local / pseudo-distributed / fully distributed Hadoop environment ? The variable {{count}} should not be static, that's definitely a problem when running multiple Generator jobs from a long-running Nutch server in local mode where all tasks are run in the same JVM. The variable {{limit}} is a per-task limit (see how it's initialized in {{setup(context)}}), comparing it with a global counter seems wrong, also retrieving the counter in every call of the reduce function may be too expensive. Why not make {{count}} an instance variable? > GeneratorJob does not generate anything on second run > - > > Key: NUTCH-2328 > URL: https://issues.apache.org/jira/browse/NUTCH-2328 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1 > Environment: Ubuntu 16.04 / Hadoop 2.7.1 >Reporter: Arthur B > Labels: fails, generator, subsequent > Fix For: 2.4 > > Attachments: generator-issue-static-count.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Given a topN parameter (ie 10) the GeneratorJob will fail to generate > anything new on the subsequent runs within the same process space. > To reproduce the issue submit the GeneratorJob twice one after another to the > M/R framework. Second time will say it generated 0 URLs. > This issue is due to the usage of the static count field > (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN > value has been reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)