date:20180126

[jira] [Created] (NUTCH-2505) nutch does not delete the .locked file, when the generator partition got an exception

2018-01-26 Thread Ajoy Lian (JIRA)

Ajoy Lian created NUTCH-2505:


 Summary: nutch does not delete the .locked file, when the 
generator partition got an exception
 Key: NUTCH-2505
 URL: https://issues.apache.org/jira/browse/NUTCH-2505
 Project: Nutch
  Issue Type: Bug
  Components: generator
Reporter: Ajoy Lian


nutch does not delete the .locked file when the generator partition got an 
exception. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341852#comment-16341852
 ] 

ASF GitHub Bot commented on NUTCH-2202:
---

HansBrende commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused 
Crawling Module) into Nutch
URL: https://github.com/apache/nutch/pull/97#issuecomment-360942425
 
 
   @lewismc I can't get your NUTCH-2202 branch to build.
   
   I'm doing:
   ```
   git clone https://github.com/lewismc/nutch
   cd nutch
   git checkout NUTCH-2202
   ant
   ```
   
   which is giving me:
   
   Buildfile: /Users/hansbrende/nutch/build.xml
   Trying to override old definition of task javac
 [taskdef] Could not load definitions from resource 
org/sonar/ant/antlib.xml. It could not be found.
   
   ivy-probe-antlib:
   
   ivy-download:
 [taskdef] Could not load definitions from resource 
org/sonar/ant/antlib.xml. It could not be found.
   
   .
   .
   .
   
   init:
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/classes
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test/lib
   [mkdir] Created dir: /Users/hansbrende/nutch/build/plugins/anthelion
   
   init-plugin:
   
   deps-jar:
   
   init:
   
   init-plugin:
   
   clean-lib:
   
   resolve-default:
   [ivy:resolve] :: loading settings :: file = 
/Users/hansbrende/nutch/ivy/ivysettings.xml
   
   compile:
   
   jar:
   
   clean-lib:
   
   resolve-default:
   [ivy:resolve] :: loading settings :: file = 
/Users/hansbrende/nutch/ivy/ivysettings.xml
   
   compile:
[echo] Compiling plugin: anthelion
   [javac] Compiling 34 source files to 
/Users/hansbrende/nutch/build/anthelion/classes
   [javac] 
/Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/classifier/NutchOnlineClassifier.java:35:
 error: cannot find symbol
   [javac] import moa.core.InstancesHeader;
   [javac]^
   [javac]   symbol:   class InstancesHeader
   [javac]   location: package moa.core
   [javac] 
/Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/framework/AnthOnlineClassifier.java:33:
 error: cannot find symbol
   [javac] import moa.core.InstancesHeader;
   [javac]^
   [javac]   symbol:   class InstancesHeader
   [javac]   location: package moa.core
   [javac] 
/Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/mao/DataManipulationFilter.java:19:
 error: cannot find symbol
   [javac] import moa.core.InstancesHeader;
   [javac]^
   [javac]   symbol:   class InstancesHeader
   [javac]   location: package moa.core
   
   .
   .
   .
   
   [javac] 46 errors
   
   BUILD FAILED
   /Users/hansbrende/nutch/build.xml:116: The following error occurred while 
executing this line:
   /Users/hansbrende/nutch/src/plugin/build.xml:37: The following error 
occurred while executing this line:
   /Users/hansbrende/nutch/src/plugin/build-plugin.xml:133: Compile failed; see 
the compiler error output for details.
   
   
   Am I doing something wrong?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341850#comment-16341850
 ] 

ASF GitHub Bot commented on NUTCH-2202:
---

HansBrende commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused 
Crawling Module) into Nutch
URL: https://github.com/apache/nutch/pull/97#issuecomment-360942425
 
 
   @lewismc I can't get your NUTCH-2202 branch to build.
   
   I'm doing:
   ```
   git clone https://github.com/lewismc/nutch
   cd nutch
   git checkout NUTCH-2202
   ant
   ```
   
   which is giving me:
   
   Buildfile: /Users/hansbrende/nutch/build.xml
   Trying to override old definition of task javac
 [taskdef] Could not load definitions from resource 
org/sonar/ant/antlib.xml. It could not be found.
   
   ivy-probe-antlib:
   
   ivy-download:
 [taskdef] Could not load definitions from resource 
org/sonar/ant/antlib.xml. It could not be found.
   
   .
   .
   .
   
   init:
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/classes
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test
   [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test/lib
   [mkdir] Created dir: /Users/hansbrende/nutch/build/plugins/anthelion
   
   init-plugin:
   
   deps-jar:
   
   init:
   
   init-plugin:
   
   clean-lib:
   
   resolve-default:
   [ivy:resolve] :: loading settings :: file = 
/Users/hansbrende/nutch/ivy/ivysettings.xml
   
   compile:
   
   jar:
   
   clean-lib:
   
   resolve-default:
   [ivy:resolve] :: loading settings :: file = 
/Users/hansbrende/nutch/ivy/ivysettings.xml
   
   compile:
[echo] Compiling plugin: anthelion
   [javac] Compiling 34 source files to 
/Users/hansbrende/nutch/build/anthelion/classes
   [javac] 
/Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/classifier/NutchOnlineClassifier.java:35:
 error: cannot find symbol
   [javac] import moa.core.InstancesHeader;
   [javac]^
   [javac]   symbol:   class InstancesHeader
   [javac]   location: package moa.core
   [javac] 
/Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/framework/AnthOnlineClassifier.java:33:
 error: cannot find symbol
   [javac] import moa.core.InstancesHeader;
   [javac]^
   [javac]   symbol:   class InstancesHeader
   [javac]   location: package moa.core
   [javac] 
/Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/mao/DataManipulationFilter.java:19:
 error: cannot find symbol
   [javac] import moa.core.InstancesHeader;
   [javac]^
   [javac]   symbol:   class InstancesHeader
   [javac]   location: package moa.core
   
   .
   .
   .
   
   BUILD FAILED
   /Users/hansbrende/nutch/build.xml:116: The following error occurred while 
executing this line:
   /Users/hansbrende/nutch/src/plugin/build.xml:37: The following error 
occurred while executing this line:
   /Users/hansbrende/nutch/src/plugin/build-plugin.xml:133: Compile failed; see 
the compiler error output for details.
   
   
   Am I doing something wrong?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341782#comment-16341782
 ] 

ASF GitHub Bot commented on NUTCH-2202:
---

lewismc commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused 
Crawling Module) into Nutch
URL: https://github.com/apache/nutch/pull/97#issuecomment-360932424
 
 
   You can subscribe as follows http://any23.apache.org/mail-lists.html, thank 
you for your support. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341521#comment-16341521
 ] 

ASF GitHub Bot commented on NUTCH-2202:
---

HansBrende commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused 
Crawling Module) into Nutch
URL: https://github.com/apache/nutch/pull/97#issuecomment-360889922
 
 
   @lewismc I tried to vote, but I'm not sure if it went through. It's possible 
that my e-mail address isn't allowed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341487#comment-16341487
 ] 

ASF GitHub Bot commented on NUTCH-2202:
---

lewismc commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused 
Crawling Module) into Nutch
URL: https://github.com/apache/nutch/pull/97#issuecomment-360883033
 
 
   @RobertMeusel @HansBrende this is ready to be tested. I would also 
appreciated if folks were able to VOTE on the current [Any23 2.2 release 
candidate](https://s.apache.org/PM3x).
   Finally, I've resolved all conflicts, updated some licensing information and 
remove binary documentation resources, instead hosting them on the [Nutch 
wiki](https://wiki.apache.org/nutch/Anthelion).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[Nutch Wiki] New attachment added to page Anthelion

2018-01-26 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page "Anthelion" for change notification. An 
attachment has been added to that page by LewisJohnMcgibbney. Following 
detailed information is available:

Attachment name: Concept_Anthelion_v2.pdf
Attachment size: 324822
Attachment link: 
https://wiki.apache.org/nutch/Anthelion?action=AttachFile=get=Concept_Anthelion_v2.pdf
Page link: https://wiki.apache.org/nutch/Anthelion

[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2018-01-26 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341326#comment-16341326
 ] 

Lewis John McGibbney commented on NUTCH-2369:
-

Hi [~markus17] the idea here was to export full graph information into 
something that could be interpreted by [Tinkerpop|http://tinkpop.apache.org] 
and queried using [Gremlin|https://tinkerpop.apache.org/gremlin.html].

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> --
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
>  Issue Type: Task
>  Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: gsoc2017, gsoc2018
> Fix For: 1.15
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341220#comment-16341220
 ] 

ASF GitHub Bot commented on NUTCH-2375:
---

Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use 
org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#issuecomment-360831934
 
 
   @lewismc yes, the crawls were running fine with Hadoop-2.7.4. Everybody is 
welcome to test this PR out. Thanks. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade the code base from org.apache.hadoop.mapred to 
> org.apache.hadoop.mapreduce
> --
>
> Key: NUTCH-2375
> URL: https://issues.apache.org/jira/browse/NUTCH-2375
> Project: Nutch
>  Issue Type: Improvement
>  Components: deployment
>Affects Versions: 1.13
>Reporter: Omkar Reddy
>Priority: Major
> Fix For: 1.15
>
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
> dependency. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340744#comment-16340744
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164052993
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -192,7 +198,7 @@ fi
 
 # note that some of the options listed here could be set in the
 # corresponding hadoop site xml param file
-commonOptions="-D mapreduce.job.reduces=$NUM_TASKS -D 
mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D 
mapreduce.map.speculative=false -D mapreduce.map.output.compress=true"
 
 Review comment:
   Ok, to remove the explicit `mapred.child.java.opts` so that the settings 
from environment variables are not overwritten in bin/nutch


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340743#comment-16340743
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164054172
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   Why should the heap size depend on the number of reducers? For a large-scale 
crawl the reducers will run independently on different nodes, ev. also 
sequentially if there are not enough computing resources available. Since 
mapred.child.java.opts is also used for the map tasks and it's often not 
possible to force a fix number of map tasks, it's better to define the heap 
size per task (usually via mapreduce.map.java.opts and 
mapreduce.reduce.java.opts).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340745#comment-16340745
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164053621
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS
 TIME_LIMIT_FETCH=180
 NUM_THREADS=50
 SITEMAPS_FROM_HOSTDB_FREQUENCY=never
+NUTCH_HEAP_MB=2000
 
 Review comment:
   bin/nutch already allows to overwrite the Java heap size via the environment 
variable 
[NUTCH_HEAPSIZE](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L24).
 Wouldn't it be simpler to set the environment variable and let bin/nutch add 
the `-D...` option?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340742#comment-16340742
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164055125
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS
 TIME_LIMIT_FETCH=180
 NUM_THREADS=50
 SITEMAPS_FROM_HOSTDB_FREQUENCY=never
+NUTCH_HEAP_MB=2000
 
 Review comment:
   I've just seen that NUTCH_HEAPSIZE (and also NUTCH_OPTS) isn't used by 
bin/nutch in distributed mode 
([L326](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L326)).
 If this was/is the problem, I would also fix it in bin/nutch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (NUTCH-2505) nutch does not delete the .locked file, when the generator partition got an exception

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

[Nutch Wiki] New attachment added to page Anthelion

[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

13 matches

Site Navigation

Mail list logo

Footer information