[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090480#comment-17090480 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: URL: https://github.com/apache/nutch/pull/279#discussion_r413699673 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: Hi @mfeltscher, this PR is now superceded by #513 - I've decided not to add any new environment variables but to document how the task memory can be set using the existing command-line flags: ``` $> bin/crawl -D mapreduce.map.memory.mb=4608 -D mapreduce.map.java.opts=-Xmx4096m \ -Dmapreduce.reduce.memory.mb=4608 -Dmapreduce.reduce.java.opts=-Xmx4096m ... ``` Thanks for contribution and the discussion! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 >Reporter: Moreno Feltscher >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090475#comment-17090475 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel opened a new pull request #513: URL: https://github.com/apache/nutch/pull/513 - bin/crawl - add hint how to set map and reduce task memory via -D ... options - use -D options for all steps (Nutch tools) - fix quoting of -D options, eg. -D plugin.includes='protocol-xyz|parse-xyz' - use -D options for all steps (Nutch tools) - bin/nutch - document that environment variables are only used in local mode (includes #512 / NUTCH-2781) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 >Reporter: Moreno Feltscher >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470312#comment-16470312 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r187313029 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: To keep it simple, I would suggest to 1. remove the explicit `mapred.child.java.opts` 2. and add a new environment variable `NUTCH_HADOOP_OPTS` which is used to add further options to the commonOptions In local mode it's still sufficient to use the environment variable NUTCH_OPTS to set the Java heap size. In distributed mode, this would allow fine-grained memory settings, e.g. ``` export NUTCH_HADOOP_OPTS="-Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7680m" ``` These are required, as the Java heap must be lower than the total memory allocated for a map/reduce task. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347729#comment-16347729 ] ASF GitHub Bot commented on NUTCH-2501: --- mfeltscher commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r165213301 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: @sebastian-nagel Any comments on this? :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343495#comment-16343495 ] ASF GitHub Bot commented on NUTCH-2501: --- mfeltscher commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164465786 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: OK, I see. So my first approach was actually better regarding this? => https://github.com/apache/nutch/pull/279/commits/38a4c7038a0a67fe696640f221cb1fdb214c2718 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343307#comment-16343307 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164417155 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: In [local mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation) all reducer tasks run in a single JVM instance. Only in [pseudo-distributed mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation) this could make some sense, given that all remaining resources (eg. number of CPUs) make it possible to run all reduce tasks in parallel. In distributed mode you want to define the max. heap size based on the configuration of your cluster nodes, because that defines how many parallel tasks can be run on every node (in combination with other resource limits). The heap size configured for a single task is usually used to define what is required to run the task without running into an out-of-memory error. The Yarn resource manager verifies that the heap size configured for the job tasks does not overflow the resource limits configured on the cluster nodes. Otherwise the job will fail. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343252#comment-16343252 ] ASF GitHub Bot commented on NUTCH-2501: --- mfeltscher commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164405090 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: Good point. This change is only supposed to be applied for crawling in `local` mode where as far as I can say it would make sense to split the amount of memory. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340744#comment-16340744 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164052993 ## File path: src/bin/crawl ## @@ -192,7 +198,7 @@ fi # note that some of the options listed here could be set in the # corresponding hadoop site xml param file -commonOptions="-D mapreduce.job.reduces=$NUM_TASKS -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true" Review comment: Ok, to remove the explicit `mapred.child.java.opts` so that the settings from environment variables are not overwritten in bin/nutch This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340743#comment-16340743 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164054172 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: Why should the heap size depend on the number of reducers? For a large-scale crawl the reducers will run independently on different nodes, ev. also sequentially if there are not enough computing resources available. Since mapred.child.java.opts is also used for the map tasks and it's often not possible to force a fix number of map tasks, it's better to define the heap size per task (usually via mapreduce.map.java.opts and mapreduce.reduce.java.opts). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340745#comment-16340745 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164053621 ## File path: src/bin/crawl ## @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS TIME_LIMIT_FETCH=180 NUM_THREADS=50 SITEMAPS_FROM_HOSTDB_FREQUENCY=never +NUTCH_HEAP_MB=2000 Review comment: bin/nutch already allows to overwrite the Java heap size via the environment variable [NUTCH_HEAPSIZE](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L24). Wouldn't it be simpler to set the environment variable and let bin/nutch add the `-D...` option? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340742#comment-16340742 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164055125 ## File path: src/bin/crawl ## @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS TIME_LIMIT_FETCH=180 NUM_THREADS=50 SITEMAPS_FROM_HOSTDB_FREQUENCY=never +NUTCH_HEAP_MB=2000 Review comment: I've just seen that NUTCH_HEAPSIZE (and also NUTCH_OPTS) isn't used by bin/nutch in distributed mode ([L326](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L326)). If this was/is the problem, I would also fix it in bin/nutch. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335999#comment-16335999 ] Moreno Feltscher commented on NUTCH-2501: - Pull request: https://github.com/apache/nutch/pull/279 > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)