[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2020-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090480#comment-17090480
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279:
URL: https://github.com/apache/nutch/pull/279#discussion_r413699673



##
File path: src/bin/crawl
##
@@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`

Review comment:
   Hi @mfeltscher, this PR is now superceded by #513 - I've decided not to 
add any new environment variables but to document how the task memory can be 
set using the existing command-line flags:
   ```
   $> bin/crawl -D mapreduce.map.memory.mb=4608 -D 
mapreduce.map.java.opts=-Xmx4096m \
 -Dmapreduce.reduce.memory.mb=4608 
-Dmapreduce.reduce.java.opts=-Xmx4096m ...
   ```
   Thanks for contribution and the discussion!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Moreno Feltscher
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2020-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090475#comment-17090475
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel opened a new pull request #513:
URL: https://github.com/apache/nutch/pull/513


   - bin/crawl
  - add hint how to set map and reduce task memory via -D ... options
  - use -D options for all steps (Nutch tools)
  - fix quoting of -D options, eg. -D 
plugin.includes='protocol-xyz|parse-xyz'
  - use -D options for all steps (Nutch tools)
   - bin/nutch
 - document that environment variables are only used in local mode
   
   (includes #512 / NUTCH-2781)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Moreno Feltscher
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470312#comment-16470312
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r187313029
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   To keep it simple, I would suggest to
   1. remove the explicit `mapred.child.java.opts`
   2. and add a new environment variable `NUTCH_HADOOP_OPTS` which is used to 
add further options to the commonOptions
   
   In local mode it's still sufficient to use the environment variable 
NUTCH_OPTS to set the Java heap size. In distributed mode, this would allow 
fine-grained memory settings, e.g.
   ```
   export NUTCH_HADOOP_OPTS="-Dmapreduce.map.memory.mb=8192 
-Dmapreduce.map.java.opts=-Xmx7680m"
   ```
   These are required, as the Java heap must be lower than the total memory 
allocated for a map/reduce task.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347729#comment-16347729
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

mfeltscher commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r165213301
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   @sebastian-nagel Any comments on this? :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343495#comment-16343495
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

mfeltscher commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164465786
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   OK, I see. So my first approach was actually better regarding this? => 
https://github.com/apache/nutch/pull/279/commits/38a4c7038a0a67fe696640f221cb1fdb214c2718


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343307#comment-16343307
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164417155
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   In [local 
mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation)
 all reducer tasks run in a single JVM instance. Only in [pseudo-distributed 
mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)
 this could make some sense, given that all remaining resources (eg. number of 
CPUs) make it possible to run all reduce tasks in parallel. In distributed mode 
you want to define the max. heap size based on the configuration of your 
cluster nodes, because that defines how many parallel tasks can be run on every 
node (in combination with other resource limits). The heap size configured for 
a single task is usually used to define what is required to run the task 
without running into an out-of-memory error. The Yarn resource manager verifies 
that the heap size configured for the job tasks does not overflow the resource 
limits configured on the cluster nodes. Otherwise the job will fail.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343252#comment-16343252
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

mfeltscher commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164405090
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   Good point. This change is only supposed to be applied for crawling in 
`local` mode where as far as I can say it would make sense to split the amount 
of memory. What do you think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340744#comment-16340744
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164052993
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -192,7 +198,7 @@ fi
 
 # note that some of the options listed here could be set in the
 # corresponding hadoop site xml param file
-commonOptions="-D mapreduce.job.reduces=$NUM_TASKS -D 
mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D 
mapreduce.map.speculative=false -D mapreduce.map.output.compress=true"
 
 Review comment:
   Ok, to remove the explicit `mapred.child.java.opts` so that the settings 
from environment variables are not overwritten in bin/nutch


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340743#comment-16340743
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164054172
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   Why should the heap size depend on the number of reducers? For a large-scale 
crawl the reducers will run independently on different nodes, ev. also 
sequentially if there are not enough computing resources available. Since 
mapred.child.java.opts is also used for the map tasks and it's often not 
possible to force a fix number of map tasks, it's better to define the heap 
size per task (usually via mapreduce.map.java.opts and 
mapreduce.reduce.java.opts).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340745#comment-16340745
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164053621
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS
 TIME_LIMIT_FETCH=180
 NUM_THREADS=50
 SITEMAPS_FROM_HOSTDB_FREQUENCY=never
+NUTCH_HEAP_MB=2000
 
 Review comment:
   bin/nutch already allows to overwrite the Java heap size via the environment 
variable 
[NUTCH_HEAPSIZE](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L24).
 Wouldn't it be simpler to set the environment variable and let bin/nutch add 
the `-D...` option?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340742#comment-16340742
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164055125
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS
 TIME_LIMIT_FETCH=180
 NUM_THREADS=50
 SITEMAPS_FROM_HOSTDB_FREQUENCY=never
+NUTCH_HEAP_MB=2000
 
 Review comment:
   I've just seen that NUTCH_HEAPSIZE (and also NUTCH_OPTS) isn't used by 
bin/nutch in distributed mode 
([L326](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L326)).
 If this was/is the problem, I would also fix it in bin/nutch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-23 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335999#comment-16335999
 ] 

Moreno Feltscher commented on NUTCH-2501:
-

Pull request: https://github.com/apache/nutch/pull/279

> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)