[jira] [Created] (NUTCH-2505) nutch does not delete the .locked file, when the generator partition got an exception
Ajoy Lian created NUTCH-2505: Summary: nutch does not delete the .locked file, when the generator partition got an exception Key: NUTCH-2505 URL: https://issues.apache.org/jira/browse/NUTCH-2505 Project: Nutch Issue Type: Bug Components: generator Reporter: Ajoy Lian nutch does not delete the .locked file when the generator partition got an exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341852#comment-16341852 ] ASF GitHub Bot commented on NUTCH-2202: --- HansBrende commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused Crawling Module) into Nutch URL: https://github.com/apache/nutch/pull/97#issuecomment-360942425 @lewismc I can't get your NUTCH-2202 branch to build. I'm doing: ``` git clone https://github.com/lewismc/nutch cd nutch git checkout NUTCH-2202 ant ``` which is giving me: Buildfile: /Users/hansbrende/nutch/build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. . . . init: [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/classes [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test/lib [mkdir] Created dir: /Users/hansbrende/nutch/build/plugins/anthelion init-plugin: deps-jar: init: init-plugin: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml compile: jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml compile: [echo] Compiling plugin: anthelion [javac] Compiling 34 source files to /Users/hansbrende/nutch/build/anthelion/classes [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/classifier/NutchOnlineClassifier.java:35: error: cannot find symbol [javac] import moa.core.InstancesHeader; [javac]^ [javac] symbol: class InstancesHeader [javac] location: package moa.core [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/framework/AnthOnlineClassifier.java:33: error: cannot find symbol [javac] import moa.core.InstancesHeader; [javac]^ [javac] symbol: class InstancesHeader [javac] location: package moa.core [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/mao/DataManipulationFilter.java:19: error: cannot find symbol [javac] import moa.core.InstancesHeader; [javac]^ [javac] symbol: class InstancesHeader [javac] location: package moa.core . . . [javac] 46 errors BUILD FAILED /Users/hansbrende/nutch/build.xml:116: The following error occurred while executing this line: /Users/hansbrende/nutch/src/plugin/build.xml:37: The following error occurred while executing this line: /Users/hansbrende/nutch/src/plugin/build-plugin.xml:133: Compile failed; see the compiler error output for details. Am I doing something wrong? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel >Assignee: Lewis John McGibbney >Priority: Major > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341850#comment-16341850 ] ASF GitHub Bot commented on NUTCH-2202: --- HansBrende commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused Crawling Module) into Nutch URL: https://github.com/apache/nutch/pull/97#issuecomment-360942425 @lewismc I can't get your NUTCH-2202 branch to build. I'm doing: ``` git clone https://github.com/lewismc/nutch cd nutch git checkout NUTCH-2202 ant ``` which is giving me: Buildfile: /Users/hansbrende/nutch/build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. . . . init: [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/classes [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test/lib [mkdir] Created dir: /Users/hansbrende/nutch/build/plugins/anthelion init-plugin: deps-jar: init: init-plugin: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml compile: jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml compile: [echo] Compiling plugin: anthelion [javac] Compiling 34 source files to /Users/hansbrende/nutch/build/anthelion/classes [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/classifier/NutchOnlineClassifier.java:35: error: cannot find symbol [javac] import moa.core.InstancesHeader; [javac]^ [javac] symbol: class InstancesHeader [javac] location: package moa.core [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/framework/AnthOnlineClassifier.java:33: error: cannot find symbol [javac] import moa.core.InstancesHeader; [javac]^ [javac] symbol: class InstancesHeader [javac] location: package moa.core [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/mao/DataManipulationFilter.java:19: error: cannot find symbol [javac] import moa.core.InstancesHeader; [javac]^ [javac] symbol: class InstancesHeader [javac] location: package moa.core . . . BUILD FAILED /Users/hansbrende/nutch/build.xml:116: The following error occurred while executing this line: /Users/hansbrende/nutch/src/plugin/build.xml:37: The following error occurred while executing this line: /Users/hansbrende/nutch/src/plugin/build-plugin.xml:133: Compile failed; see the compiler error output for details. Am I doing something wrong? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel >Assignee: Lewis John McGibbney >Priority: Major > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341782#comment-16341782 ] ASF GitHub Bot commented on NUTCH-2202: --- lewismc commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused Crawling Module) into Nutch URL: https://github.com/apache/nutch/pull/97#issuecomment-360932424 You can subscribe as follows http://any23.apache.org/mail-lists.html, thank you for your support. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel >Assignee: Lewis John McGibbney >Priority: Major > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341521#comment-16341521 ] ASF GitHub Bot commented on NUTCH-2202: --- HansBrende commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused Crawling Module) into Nutch URL: https://github.com/apache/nutch/pull/97#issuecomment-360889922 @lewismc I tried to vote, but I'm not sure if it went through. It's possible that my e-mail address isn't allowed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel >Assignee: Lewis John McGibbney >Priority: Major > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341487#comment-16341487 ] ASF GitHub Bot commented on NUTCH-2202: --- lewismc commented on issue #97: NUTCH-2202 Integration of Anthelion (Focused Crawling Module) into Nutch URL: https://github.com/apache/nutch/pull/97#issuecomment-360883033 @RobertMeusel @HansBrende this is ready to be tested. I would also appreciated if folks were able to VOTE on the current [Any23 2.2 release candidate](https://s.apache.org/PM3x). Finally, I've resolved all conflicts, updated some licensing information and remove binary documentation resources, instead hosting them on the [Nutch wiki](https://wiki.apache.org/nutch/Anthelion). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel >Assignee: Lewis John McGibbney >Priority: Major > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[Nutch Wiki] New attachment added to page Anthelion
Dear Wiki user, You have subscribed to a wiki page "Anthelion" for change notification. An attachment has been added to that page by LewisJohnMcgibbney. Following detailed information is available: Attachment name: Concept_Anthelion_v2.pdf Attachment size: 324822 Attachment link: https://wiki.apache.org/nutch/Anthelion?action=AttachFile=get=Concept_Anthelion_v2.pdf Page link: https://wiki.apache.org/nutch/Anthelion
[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341326#comment-16341326 ] Lewis John McGibbney commented on NUTCH-2369: - Hi [~markus17] the idea here was to export full graph information into something that could be interpreted by [Tinkerpop|http://tinkpop.apache.org] and queried using [Gremlin|https://tinkerpop.apache.org/gremlin.html]. > Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph > -- > > Key: NUTCH-2369 > URL: https://issues.apache.org/jira/browse/NUTCH-2369 > Project: Nutch > Issue Type: Task > Components: crawldb, graphgenerator, hostdb, linkdb, segment, > storage, tool >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Labels: gsoc2017, gsoc2018 > Fix For: 1.15 > > > I've been thinking for quite some time now that a new Tool which writes Nutch > data out as full graph data would be an excellent addition to the codebase. > My thoughts involves writing data using Tinkerpop's ScriptInputFormat and > ScriptOutputFormat's to create Vertex objects representing Nutch Crawl > Records. > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html > I envisage that each Vertex object would require the CrawlDB, LinkDB a > Segment and possibly the HostDB in order to be fully populated. Graph > characteristics e.g. Edge's would comes from those existing data structures > as well. > It is my intention to propose this as a GSoC project for 2017 and I have > already talked offline with a potential student [~omkar20895] about him > participating as the student. > Essentially, if we were able to create a Graph enabling true traversal, this > could be a game changer for how Nutch Crawl data is interpreted. It is my > feeling that this issue most likely also involved an entire upgrade of the > Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341220#comment-16341220 ] ASF GitHub Bot commented on NUTCH-2375: --- Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-360831934 @lewismc yes, the crawls were running fine with Hadoop-2.7.4. Everybody is welcome to test this PR out. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade the code base from org.apache.hadoop.mapred to > org.apache.hadoop.mapreduce > -- > > Key: NUTCH-2375 > URL: https://issues.apache.org/jira/browse/NUTCH-2375 > Project: Nutch > Issue Type: Improvement > Components: deployment >Affects Versions: 1.13 >Reporter: Omkar Reddy >Priority: Major > Fix For: 1.15 > > > Nutch is still using the deprecated org.apache.hadoop.mapred dependency which > has been deprecated. It need to be updated to org.apache.hadoop.mapreduce > dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340744#comment-16340744 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164052993 ## File path: src/bin/crawl ## @@ -192,7 +198,7 @@ fi # note that some of the options listed here could be set in the # corresponding hadoop site xml param file -commonOptions="-D mapreduce.job.reduces=$NUM_TASKS -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true" Review comment: Ok, to remove the explicit `mapred.child.java.opts` so that the settings from environment variables are not overwritten in bin/nutch This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340743#comment-16340743 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164054172 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: Why should the heap size depend on the number of reducers? For a large-scale crawl the reducers will run independently on different nodes, ev. also sequentially if there are not enough computing resources available. Since mapred.child.java.opts is also used for the map tasks and it's often not possible to force a fix number of map tasks, it's better to define the heap size per task (usually via mapreduce.map.java.opts and mapreduce.reduce.java.opts). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340745#comment-16340745 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164053621 ## File path: src/bin/crawl ## @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS TIME_LIMIT_FETCH=180 NUM_THREADS=50 SITEMAPS_FROM_HOSTDB_FREQUENCY=never +NUTCH_HEAP_MB=2000 Review comment: bin/nutch already allows to overwrite the Java heap size via the environment variable [NUTCH_HEAPSIZE](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L24). Wouldn't it be simpler to set the environment variable and let bin/nutch add the `-D...` option? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340742#comment-16340742 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164055125 ## File path: src/bin/crawl ## @@ -105,6 +105,10 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS TIME_LIMIT_FETCH=180 NUM_THREADS=50 SITEMAPS_FROM_HOSTDB_FREQUENCY=never +NUTCH_HEAP_MB=2000 Review comment: I've just seen that NUTCH_HEAPSIZE (and also NUTCH_OPTS) isn't used by bin/nutch in distributed mode ([L326](https://github.com/apache/nutch/blob/e533ab21b18cf81a49e052185562a7e6489ec4d6/src/bin/nutch#L326)). If this was/is the problem, I would also fix it in bin/nutch. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)