[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276667#comment-17276667 ] Hudson commented on NUTCH-1403: --- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #24 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/24/]) fix for NUTCH-1403 contributed by aalbahem (ameer.albahem: [https://github.com/apache/nutch/commit/598bbc40a3d3438233813b607cb031a6bb0a2f84]) * (add) src/plugin/scoring-metadata/pom.xml * (add) src/plugin/scoring-metadata/plugin.xml * (add) src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java * (add) src/plugin/scoring-metadata/build.xml * (add) src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html * (add) src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java * (add) src/plugin/scoring-metadata/ivy.xml * (edit) build.xml * (edit) src/plugin/build.xml Improve fix for NUTCH-1403 (ameer.albahem: [https://github.com/apache/nutch/commit/cdb6b52b02958385497804ef7cd6a6b646616208]) * (edit) default.properties * (delete) src/plugin/scoring-metadata/pom.xml * (edit) conf/nutch-default.xml * (delete) src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java * (edit) src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html * (add) src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java * (edit) src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java Improve NUTCH-1403, add ASLv2 header (ameer.albahem: [https://github.com/apache/nutch/commit/93aa2ab41097511f3afe8d34c9c13cafd735cec9]) * (edit) src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java * (edit) src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Priority: Major > Fix For: 1.19 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276633#comment-17276633 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel merged pull request #458: URL: https://github.com/apache/nutch/pull/458 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276632#comment-17276632 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel commented on pull request #458: URL: https://github.com/apache/nutch/pull/458#issuecomment-771132058 +1 Excellent, @aalbahem! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273250#comment-17273250 ] ASF GitHub Bot commented on NUTCH-1403: --- lewismc commented on pull request #458: URL: https://github.com/apache/nutch/pull/458#issuecomment-768705589 Anyone else able to review? @sebastian-nagel are you able to revisit your review? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263621#comment-17263621 ] ASF GitHub Bot commented on NUTCH-1403: --- lewismc commented on pull request #458: URL: https://github.com/apache/nutch/pull/458#issuecomment-758877653 @aalbahem please see my comments This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.18 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263620#comment-17263620 ] ASF GitHub Bot commented on NUTCH-1403: --- lewismc commented on a change in pull request #458: URL: https://github.com/apache/nutch/pull/458#discussion_r556012181 ## File path: src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html ## @@ -0,0 +1,17 @@ + Review comment: Please include ASLv2 header ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.parse.*; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.scoring.ScoringFilterException; +import org.apache.nutch.util.NutchConfiguration; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.util.HashMap; + +public class TestMetadataScoringFilter { + + @Before Review comment: This is redundant and can be removed. ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: Please include ASLv2 header This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.18 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262888#comment-17262888 ] ASF GitHub Bot commented on NUTCH-1403: --- lewismc commented on a change in pull request #458: URL: https://github.com/apache/nutch/pull/458#discussion_r555283893 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: Excellent @aalbahem I'll begin reviewing. Thank you This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.18 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262312#comment-17262312 ] ASF GitHub Bot commented on NUTCH-1403: --- aalbahem commented on a change in pull request #458: URL: https://github.com/apache/nutch/pull/458#discussion_r554635298 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: @lewismc I have addressed the comments. Let me know if you need anything else. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.18 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261001#comment-17261001 ] ASF GitHub Bot commented on NUTCH-1403: --- lewismc commented on a change in pull request #458: URL: https://github.com/apache/nutch/pull/458#discussion_r553744204 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: Wow that was quick. All the best with your Ph.D. defense. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.18 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260995#comment-17260995 ] ASF GitHub Bot commented on NUTCH-1403: --- aalbahem commented on a change in pull request #458: URL: https://github.com/apache/nutch/pull/458#discussion_r553741862 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: Yes, I can fix them. I happen to be too busy with my PhD thesis. I have more free time now. I can do that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.18 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260992#comment-17260992 ] ASF GitHub Bot commented on NUTCH-1403: --- lewismc commented on a change in pull request #458: URL: https://github.com/apache/nutch/pull/458#discussion_r553739368 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: @aalbahem are you able to update this patch. It would be a shame for it to sit forever. Great work, I can review it if you can address @sebastian-nagel 's comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.18 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941869#comment-16941869 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458#discussion_r330068083 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: Great! I'll move NUTCH-1403 to the Nutch 1.17 tasks, as I hope to get the 1.16 release candidate ready today or tomorrow. No hurry! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941851#comment-16941851 ] ASF GitHub Bot commented on NUTCH-1403: --- aalbahem commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458#discussion_r330059885 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: Thanks, Sebastian, I will address your comments. Do you have any time frame when this should be done? If you plan to merge this into a release, then I can try to get it done by this weekend. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941784#comment-16941784 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458#discussion_r330030804 ## File path: src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java ## @@ -0,0 +1,113 @@ +package org.apache.nutch.scoring.metadata; Review comment: Nice to have a test! Alone it isn't executed running ant. Test classes must be named "Test*.java". To run only the scoring-metadata unit tests: ``` ant test-plugin -Dplugin=scoring-metadata ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941787#comment-16941787 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458#discussion_r330025322 ## File path: src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html ## @@ -0,0 +1,10 @@ + + + + Metadata Scoring Plugin + + + Propagates Meta data, injected from in db, parse data or content from one stage to another. Review comment: Could be more detailed, cf. [urlmeta package description](/apache/nutch/blob/master/src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/package.html), at least the 3 steps (resp. 4 containers) should be in the right order: CrawlDb --> content --> parse data --> outlinks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941786#comment-16941786 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458#discussion_r330020972 ## File path: src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.scoring.metadata; + +import java.util.Collection; +import java.util.Map.Entry; +import java.util.Iterator; +import java.util.List; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.scoring.ScoringFilter; +import org.apache.nutch.scoring.ScoringFilterException; + + +/** + * For documentation: + * + * {@link org.apache.nutch.scoring.metadata} + */ +public class MetadataScoringFilter extends Configured implements ScoringFilter { Review comment: Could inherit from [AbstractScoringFilter](/apache/nutch/blob/master/src/java/org/apache/nutch/scoring/AbstractScoringFilter.java) which already implements the boilerplate methods. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941785#comment-16941785 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458#discussion_r330021393 ## File path: src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.scoring.metadata; + +import java.util.Collection; +import java.util.Map.Entry; +import java.util.Iterator; +import java.util.List; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.scoring.ScoringFilter; +import org.apache.nutch.scoring.ScoringFilterException; + + +/** + * For documentation: + * + * {@link org.apache.nutch.scoring.metadata} + */ +public class MetadataScoringFilter extends Configured implements ScoringFilter { + + public static final String METADATA_DATUM = "scoring.db.md"; Review comment: The three new properties should be "documented" in `conf/nutch-default.xml`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941788#comment-16941788 ] ASF GitHub Bot commented on NUTCH-1403: --- sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458#discussion_r330019665 ## File path: src/plugin/scoring-metadata/pom.xml ## @@ -0,0 +1,38 @@ + Review comment: Why a pom.xml? Nutch is still built using ant. It shouldn't be included as it only confuses users which then try to run "mvn" and wonder why the compilation fails or the packages are not bundled properly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875817#comment-16875817 ] ASF GitHub Bot commented on NUTCH-1403: --- aalbahem commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem URL: https://github.com/apache/nutch/pull/458 The following were done: 1. Copy much of urlmeta to a new scoring plugin (scoring-metadata) 2. Introduce **scoring.db.md** to copy data from a crawl datum and inject to content 3. Introduce **scoring.content.md** to copy metadata from a content and inject to parse data 4. Introduce **scoring.parse.md** to copy metadata form a parse data to a crawl datum outlinks as db metadata 5. Change the necessary plugins to include the plugin in the clean, test and deploy tasks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Priority: Major > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875815#comment-16875815 ] ASF GitHub Bot commented on NUTCH-1403: --- aalbahem commented on pull request #458: fix for NUTCH-1403 contributed by aalbahem The following were done: 1. Copy much of urlmeta to a new scoring plugin (scoring-metadata) 2. Introduce **scoring.db.md** to copy data from a crawl datum and inject to content 3. Introduce **scoring.content.md* to copy metadata from a content and inject to parse data 4. Introduce **scoring.parse.md* to copy metadata form a parse data to a crawl datum outlinks as db metadata 5. Change the necessary plugins to include the plugin in the clean, test and deploy tasks URL: https://github.com/apache/nutch/pull/458 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Priority: Major > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151545#comment-14151545 ] Ameer Tawfik Albaham commented on NUTCH-1403: - Good. I also discarded the fourth one as I reached the same conclusion you just mentioned. As I said, I have already implemented. I will be able to finish the whole thing and prepare a patch by this weekend. > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151537#comment-14151537 ] Julien Nioche commented on NUTCH-1403: -- You are right, the urlmeta plugin currently does not distinguish between the different levels and uses a single config 'urlmeta.tags'. Having 3 separate configs is probably a good idea. Not sure the 4th one you suggested is really necessary ("a list of meta data to pass from crawl datum to content to parse data") as it is the same as using the same values for the first 2. Having it would probably generate some confusion. Thanks > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151515#comment-14151515 ] Ameer Tawfik Albaham commented on NUTCH-1403: - I have implemented it. I am currently writing test units. However, I made a separate list for each of the three cases. In my work, I encountered cases where I need to pass information from crawl datum to content or from parse data to outlink. So, I felt the separation is necessary. Do you have any thoughts regarding that? > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151487#comment-14151487 ] Julien Nioche commented on NUTCH-1403: -- Hi Ameer The idea here is simply to move the handling of metadata away from the url-meta plugin to the core of the code and get rid of url-meta. > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150747#comment-14150747 ] Ameer Tawfik Albaham commented on NUTCH-1403: - I would like to contribute to this issue. However, I am a bit not sure what else is required in addition to what *urlmeta* is doing. Does this plugin have a different list of meta data for each case: - a list of meta data to pass from crawl datum to content. - a list of meta data to pass from content to parse data. - a list of meta data to pass from parse data to outlinks. - a list of meta data to pass from crawl datum to content to parse data. > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)