[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-02-01 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276667#comment-17276667
 ] 

Hudson commented on NUTCH-1403:
---

SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #24 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/24/])
fix for NUTCH-1403 contributed by aalbahem (ameer.albahem: 
[https://github.com/apache/nutch/commit/598bbc40a3d3438233813b607cb031a6bb0a2f84])
* (add) src/plugin/scoring-metadata/pom.xml
* (add) src/plugin/scoring-metadata/plugin.xml
* (add) 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
* (add) src/plugin/scoring-metadata/build.xml
* (add) 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html
* (add) 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java
* (add) src/plugin/scoring-metadata/ivy.xml
* (edit) build.xml
* (edit) src/plugin/build.xml
Improve fix for NUTCH-1403 (ameer.albahem: 
[https://github.com/apache/nutch/commit/cdb6b52b02958385497804ef7cd6a6b646616208])
* (edit) default.properties
* (delete) src/plugin/scoring-metadata/pom.xml
* (edit) conf/nutch-default.xml
* (delete) 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
* (edit) 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html
* (add) 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java
* (edit) 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java
Improve NUTCH-1403, add ASLv2 header (ameer.albahem: 
[https://github.com/apache/nutch/commit/93aa2ab41097511f3afe8d34c9c13cafd735cec9])
* (edit) 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java
* (edit) 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.19
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276633#comment-17276633
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel merged pull request #458:
URL: https://github.com/apache/nutch/pull/458


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276632#comment-17276632
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel commented on pull request #458:
URL: https://github.com/apache/nutch/pull/458#issuecomment-771132058


   +1 Excellent, @aalbahem!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273250#comment-17273250
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

lewismc commented on pull request #458:
URL: https://github.com/apache/nutch/pull/458#issuecomment-768705589


   Anyone else able to review? @sebastian-nagel are you able to revisit your 
review? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263621#comment-17263621
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

lewismc commented on pull request #458:
URL: https://github.com/apache/nutch/pull/458#issuecomment-758877653


   @aalbahem please see my comments 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.18
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263620#comment-17263620
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

lewismc commented on a change in pull request #458:
URL: https://github.com/apache/nutch/pull/458#discussion_r556012181



##
File path: 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html
##
@@ -0,0 +1,17 @@
+

Review comment:
   Please include ASLv2 header

##
File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java
##
@@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.scoring.ScoringFilterException;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.util.HashMap;
+
+public class TestMetadataScoringFilter {
+
+  @Before

Review comment:
   This is redundant and can be removed. 

##
File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/TestMetadataScoringFilter.java
##
@@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;

Review comment:
   Please include ASLv2 header





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.18
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262888#comment-17262888
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

lewismc commented on a change in pull request #458:
URL: https://github.com/apache/nutch/pull/458#discussion_r555283893



##
File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
##
@@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;

Review comment:
   Excellent @aalbahem I'll begin reviewing. Thank you





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.18
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262312#comment-17262312
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

aalbahem commented on a change in pull request #458:
URL: https://github.com/apache/nutch/pull/458#discussion_r554635298



##
File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
##
@@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;

Review comment:
   @lewismc I have addressed the comments. Let me know if you need anything 
else.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.18
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261001#comment-17261001
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

lewismc commented on a change in pull request #458:
URL: https://github.com/apache/nutch/pull/458#discussion_r553744204



##
File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
##
@@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;

Review comment:
   Wow that was quick. All the best with your Ph.D. defense. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.18
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260995#comment-17260995
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

aalbahem commented on a change in pull request #458:
URL: https://github.com/apache/nutch/pull/458#discussion_r553741862



##
File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
##
@@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;

Review comment:
   Yes, I can fix them. I happen to be too busy with my PhD thesis. I have 
more free time now. I can do that.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.18
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2021-01-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260992#comment-17260992
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

lewismc commented on a change in pull request #458:
URL: https://github.com/apache/nutch/pull/458#discussion_r553739368



##
File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
##
@@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;

Review comment:
   @aalbahem are you able to update this patch. It would be a shame for it 
to sit forever. Great work, I can review it if you can address @sebastian-nagel 
's comments. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.18
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941869#comment-16941869
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed 
by aalbahem
URL: https://github.com/apache/nutch/pull/458#discussion_r330068083
 
 

 ##
 File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
 ##
 @@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;
 
 Review comment:
   Great! I'll move NUTCH-1403 to the Nutch 1.17 tasks, as I hope to get the 
1.16 release candidate ready today or tomorrow. No hurry!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941851#comment-16941851
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

aalbahem commented on pull request #458: fix for NUTCH-1403 contributed by 
aalbahem
URL: https://github.com/apache/nutch/pull/458#discussion_r330059885
 
 

 ##
 File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
 ##
 @@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;
 
 Review comment:
   Thanks, Sebastian, I will address your comments.  Do you have any time frame 
when this should be done? If you plan to merge this into a release, then I can 
try to get it done by this weekend.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941784#comment-16941784
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed 
by aalbahem
URL: https://github.com/apache/nutch/pull/458#discussion_r330030804
 
 

 ##
 File path: 
src/plugin/scoring-metadata/src/test/org/apache/nutch/scoring/metadata/MetadataScoringFilterTest.java
 ##
 @@ -0,0 +1,113 @@
+package org.apache.nutch.scoring.metadata;
 
 Review comment:
   Nice to have a test! Alone it isn't executed running ant. Test classes must 
be named "Test*.java".
   
   To run only the scoring-metadata unit tests:
   ```
   ant test-plugin -Dplugin=scoring-metadata
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941787#comment-16941787
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed 
by aalbahem
URL: https://github.com/apache/nutch/pull/458#discussion_r330025322
 
 

 ##
 File path: 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/package.html
 ##
 @@ -0,0 +1,10 @@
+
+  
+
+  Metadata Scoring Plugin
+
+
+  Propagates Meta data, injected from in db, parse data or content from 
one stage to another. 
 
 Review comment:
   Could be more detailed, cf. [urlmeta package 
description](/apache/nutch/blob/master/src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/package.html),
 at least the 3 steps (resp. 4 containers) should be in the right order: 
CrawlDb --> content --> parse data --> outlinks
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941786#comment-16941786
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed 
by aalbahem
URL: https://github.com/apache/nutch/pull/458#discussion_r330020972
 
 

 ##
 File path: 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java
 ##
 @@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.scoring.metadata;
+
+import java.util.Collection;
+import java.util.Map.Entry;
+import java.util.Iterator;
+import java.util.List;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.scoring.ScoringFilter;
+import org.apache.nutch.scoring.ScoringFilterException;
+
+
+/**
+ * For documentation:
+ * 
+ * {@link org.apache.nutch.scoring.metadata}
+ */
+public class MetadataScoringFilter extends Configured implements ScoringFilter 
{
 
 Review comment:
   Could inherit from 
[AbstractScoringFilter](/apache/nutch/blob/master/src/java/org/apache/nutch/scoring/AbstractScoringFilter.java)
 which already implements the boilerplate methods.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941785#comment-16941785
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed 
by aalbahem
URL: https://github.com/apache/nutch/pull/458#discussion_r330021393
 
 

 ##
 File path: 
src/plugin/scoring-metadata/src/java/org/apache/nutch/scoring/metadata/MetadataScoringFilter.java
 ##
 @@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.scoring.metadata;
+
+import java.util.Collection;
+import java.util.Map.Entry;
+import java.util.Iterator;
+import java.util.List;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.scoring.ScoringFilter;
+import org.apache.nutch.scoring.ScoringFilterException;
+
+
+/**
+ * For documentation:
+ * 
+ * {@link org.apache.nutch.scoring.metadata}
+ */
+public class MetadataScoringFilter extends Configured implements ScoringFilter 
{
+
+  public static final String METADATA_DATUM   = "scoring.db.md";
 
 Review comment:
   The three new properties should be "documented" in `conf/nutch-default.xml`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941788#comment-16941788
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

sebastian-nagel commented on pull request #458: fix for NUTCH-1403 contributed 
by aalbahem
URL: https://github.com/apache/nutch/pull/458#discussion_r330019665
 
 

 ##
 File path: src/plugin/scoring-metadata/pom.xml
 ##
 @@ -0,0 +1,38 @@
+
 
 Review comment:
   Why a pom.xml? Nutch is still built using ant. It shouldn't be included as 
it only confuses users which then try to run "mvn" and wonder why the 
compilation fails or the packages are not bundled properly.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-06-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875817#comment-16875817
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

aalbahem commented on pull request #458: fix for NUTCH-1403 contributed by 
aalbahem
URL: https://github.com/apache/nutch/pull/458
 
 
   The following were done: 
   1. Copy much of urlmeta to a new scoring plugin (scoring-metadata) 
   2. Introduce **scoring.db.md** to copy data from a crawl datum and inject to 
content 
   3. Introduce **scoring.content.md** to copy metadata from a content and 
inject to parse data 
   4. Introduce **scoring.parse.md** to copy metadata form a parse data to a 
crawl datum outlinks as db metadata 
   5. Change the necessary plugins to include the plugin in the clean, test and 
deploy tasks
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Priority: Major
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-06-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875815#comment-16875815
 ] 

ASF GitHub Bot commented on NUTCH-1403:
---

aalbahem commented on pull request #458: fix for NUTCH-1403 contributed by 
aalbahem The following were done: 1. Copy much of urlmeta to a new scoring 
plugin (scoring-metadata) 2. Introduce **scoring.db.md** to copy data from a 
crawl datum and inject to content 3. Introduce **scoring.content.md* to copy 
metadata from a content and inject to parse data 4. Introduce 
**scoring.parse.md* to copy metadata form a parse data to a crawl datum 
outlinks as db metadata 5. Change the necessary plugins to include the plugin 
in the clean, test and deploy tasks
URL: https://github.com/apache/nutch/pull/458
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Priority: Major
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2014-09-29 Thread Ameer Tawfik Albaham (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151545#comment-14151545
 ] 

Ameer Tawfik Albaham commented on NUTCH-1403:
-

Good. I also discarded the fourth one as I reached the same conclusion you just 
mentioned. As I said, I have already implemented. I will be able to finish the 
whole thing and prepare a patch by this weekend.


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2014-09-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151537#comment-14151537
 ] 

Julien Nioche commented on NUTCH-1403:
--

You are right, the urlmeta plugin currently does not distinguish between the 
different levels and uses a single config 'urlmeta.tags'. Having 3 separate 
configs is probably a good idea.
Not sure the 4th one you suggested is really necessary ("a list of meta data to 
pass from crawl datum to content to parse data") as it is the same as using the 
same values for the first 2. Having it would probably generate some confusion.

Thanks 

> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2014-09-29 Thread Ameer Tawfik Albaham (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151515#comment-14151515
 ] 

Ameer Tawfik Albaham commented on NUTCH-1403:
-

I have implemented it. I am currently writing test units. However, I made a 
separate list for each of the three cases. In my work, I encountered cases 
where I need to pass information from crawl datum to content or from parse data 
to outlink. So, I felt the separation is necessary. Do you have any thoughts 
regarding that?

> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2014-09-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151487#comment-14151487
 ] 

Julien Nioche commented on NUTCH-1403:
--

Hi Ameer

The idea here is simply to move the handling of metadata away from the url-meta 
plugin to the core of the code and get rid of url-meta. 



> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2014-09-27 Thread Ameer Tawfik Albaham (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150747#comment-14150747
 ] 

Ameer Tawfik Albaham commented on NUTCH-1403:
-

I would like to contribute to this issue. However, I am a bit not sure what 
else is required in addition to what *urlmeta* is doing. Does this plugin have 
a different list of meta data for each case:

- a list of meta data to pass from crawl datum to content.
- a list of meta data to pass from content to parse data.
- a list of meta data to pass from parse data to outlinks.
- a list of meta data to pass from crawl datum to content to parse data.


> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)